Skip to content

(faster session reinitialization) Remove timeout for reestablishing new connection in case of read-only table error#42323

Closed
nikitamikhaylov wants to merge 9 commits intomasterfrom
restaring-thread-investigation
Closed

(faster session reinitialization) Remove timeout for reestablishing new connection in case of read-only table error#42323
nikitamikhaylov wants to merge 9 commits intomasterfrom
restaring-thread-investigation

Conversation

@nikitamikhaylov
Copy link
Copy Markdown
Member

@nikitamikhaylov nikitamikhaylov commented Oct 14, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

When ZooKeeper session expires the MergeTreeRestartingThread will be notified automatically and almost instantly but it need to try several times to establish new connection. This PR enables retries without any timeout after first error occur. This closes #42251

@robot-ch-test-poll robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Oct 14, 2022
@nikitamikhaylov
Copy link
Copy Markdown
Member Author

This test is based on sleep, so I will just see how flaky it is.

{
storage.replica_is_active_node = nullptr;
}
catch (...)
Copy link
Copy Markdown
Member

@devcrafter devcrafter Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unnecessary since ~EphemeralNodeHolder already catch all exceptions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Seems like issue is not here as was described in the issue

@alexey-milovidov alexey-milovidov changed the title Ignore the exception about failed node deletion from Zookeeper (faster session reinitialization) Ignore the exception about failed node deletion from Zookeeper Oct 17, 2022
@alexey-milovidov alexey-milovidov self-assigned this Oct 17, 2022
@nikitamikhaylov nikitamikhaylov changed the title (faster session reinitialization) Ignore the exception about failed node deletion from Zookeeper (faster session reinitialization) Remove timeout for reestablishing new connection in case of read-only table error Oct 17, 2022
@nikitamikhaylov
Copy link
Copy Markdown
Member Author

@Algunenano As was mentioned above EphemeralNodeHolder doesn't throw any exception outside but just catches it and prints to log. I checked and seems like no functions that we execute during partial shutdown/initialization throws any exceptions, but nevertheless I removed even 1s timeout to recheck the health and replica state and I rewrote the code a bit to make it more reliable in my opinion

@Algunenano
Copy link
Copy Markdown
Member

I've tested this PR by having 1 insert per second on a replicated table and restarting ZK and it sill takes ~30 seconds to recover. The same operation takes ~10 seconds on 21.8:

$ while true; do date; ch_client_prod-01 --query "Insert into t1 values (23)"; sleep 1; done 
lun 17 oct 2022 19:09:53 CEST
lun 17 oct 2022 19:09:54 CEST
lun 17 oct 2022 19:09:55 CEST
lun 17 oct 2022 19:09:56 CEST
lun 17 oct 2022 19:09:57 CEST
lun 17 oct 2022 19:09:58 CEST
lun 17 oct 2022 19:09:59 CEST
lun 17 oct 2022 19:10:00 CEST
Received exception from server (version 21.8.14):
Code: 999. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Cannot allocate block number in ZooKeeper: Coordination::Exception: Connection loss. (KEEPER_EXCEPTION)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:01 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:02 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:03 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:04 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:05 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:06 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:07 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:08 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:09 CEST
Received exception from server (version 21.8.14):
Code: 242. DB::Exception: Received from clickhouse-01:49000. DB::Exception: Table is in readonly mode (zookeeper path: /clickhouse/tables/01-01/default.t1). (TABLE_IS_READ_ONLY)
(query: Insert into t1 values (23))
lun 17 oct 2022 19:10:10 CEST

With the PR is goes from 19:04:38 to 19:05:10 (32 seconds still).

My feeling is that this time should be much lower, not just lower than the 30+ seconds in master but lower than the 10 seconds of 21.8. I'll try (tomorrow) with a ZK cluster instead of a standalone one where the recovery should be immediate, but I think that failed operations should influence how often the check happens. Not sure if the query itself should try to recover the ZK connection (which would be nice if possible) or at least advance the check to a reasonable time (so it doesn't happen a lot, but maybe once a second). I need to think more about it.

@nikitamikhaylov
Copy link
Copy Markdown
Member Author

The only idea I have is that reconnection slowness happens because of some timeouts inside ZooKeeper (like timeout for deleting ephemeral nodes, etc)

@Algunenano
Copy link
Copy Markdown
Member

It seems that there is another problem as initializing the replica is taking ~20 seconds:

oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853405 [ 357376 ] {} <Trace> ZooKeeperClient: Failed to get API version
oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853442 [ 357376 ] {} <Trace> ZooKeeper: Initialized, hosts: 127.0.0.1:2181
oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853452 [ 357376 ] {} <Debug> Context: Establishing a new connection with ZooKeeper took 10121 ms
oct 18 17:59:49 Mordor clickhouse[357188]: 2022.10.18 15:59:49.853461 [ 357376 ] {} <Debug> default.t1 (ReplicatedMergeTreeRestartingThread): Trying to start replica up
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.539064 [ 357376 ] {} <Trace> default.t1 (ReplicatedMergeTreeQueue): Initializing parts in queue
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.539927 [ 357376 ] {} <Trace> default.t1 (ReplicatedMergeTreeQueue): Queue initialized
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.539938 [ 357376 ] {} <Debug> default.t1 (ReplicatedMergeTreeQueue): Loading queue from /clickhouse/tables/01-01/default.t1/replicas/clickhouse-01/queue
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.540736 [ 357376 ] {} <Debug> default.t1 (ReplicatedMergeTreeQueue): Having 0 queue entries to load, 0 entries already loaded.
oct 18 18:00:11 Mordor clickhouse[357188]: 2022.10.18 16:00:11.541543 [ 357376 ] {} <Trace> default.t1 (ReplicatedMergeTreeQueue): Loaded queue

I'm investigating more to see what's going on

@Algunenano
Copy link
Copy Markdown
Member

Well, narrowing it down was easy. It's being spent in waitForEphemeralToDisappearIfAny.

I guess that since we couldn't remove the lock (as there isn't a connection to ZK), it finds it and waits for it to be removed automatically by ZK (it's an EPHEMERAL_SEQUENTIAL node which is removed upon the client's disconnect). I'll dig deeper to understand if this is something that should be handled better in the keeper configuration, or there needs to be a better way to deal with it on reconnections.

@Algunenano
Copy link
Copy Markdown
Member

Following up:

I'm surprised about the change in that PR because it's set up to throw a LOGICAL_ERROR if the node takes too long to disappear, which doesn't seem the right exception to me considering I was expecting this to handle configuration issues (where a user creates several servers with the same config by mistake). While this behaviour might be desirable in some cases vs the removal of the node, I think that in cases of a lost session it's doing a disservice. It'd make sense to keep a flag of "unclean shutdown" and remove the old node if necessary, but I hope there is a simpler/cleaner way.

@nikitamikhaylov
Copy link
Copy Markdown
Member Author

Well, we already discussed internally to remove this code (and return back explicit deletion of this node on start). AFAIK we found a couple of bugs, so LOGICAL_ERROR is a bit overkill but forces us to look at any errors..

@Algunenano
Copy link
Copy Markdown
Member

I'm currently working on going back to deletion of the node if the storage was partial_shutdown_called. After that I want to also have a look at why it takes 10 seconds to get a new ZK session. If you are working on it then I'll work on something else, otherwise I'll prepare a PR with all the changes once I have them.

@nikitamikhaylov
Copy link
Copy Markdown
Member Author

Probably we can just return the old code, but make it more smart - e.g. write UUID of server to Zookeeper and then check before deleting the node.

@tavplubix
Copy link
Copy Markdown
Member

I'm surprised about the change in that PR because it's set up to throw a LOGICAL_ERROR if the node takes too long to disappear, which doesn't seem the right exception to me

Yes, it's not a logical error actually and we can change the error code. But usually 3x session timeout should be enough for ephemeral node to get removed. And we want to check it in CI to catch issues like #28519 and #41215.

I'm currently working on going back to deletion of the node if the storage was partial_shutdown_called.

OK, maybe we can resurrect code that removes the node, but we should do it carefully, because ephemeral nodes are kind of distributed locks and ideologically it's not quite correct to forcefully remove them (we cannot unlock a lock if it's held by someone else). But yes, adding ServerUUID::get() to generateActiveNodeIdentifier() may help to ensure that it's our node.

@Algunenano
Copy link
Copy Markdown
Member

adding ServerUUID::get() to generateActiveNodeIdentifier() may help to ensure that it's our node.

I think it's not necessary to change the node identifier as it's random and it won't change because a table restart. We can have a logic like:

  • If the node doesn't exist, ok and continue.
  • If the node exists and the id matches with us, delete it to recreate it again in the new session.
  • If the node exists and the id doesn't match (likely an unclean shutdown), wait?

@Algunenano
Copy link
Copy Markdown
Member

I've setup a local 3-server keeper cluster to do tests closer to a production environment, as before I was testing with a single node which isn't realistic or advisable. With this PR and the ephemeral node fix (delete if the same instead of waiting for it to go away) the time to recovery is less than a second.

I'm testing the time to recovery by ingesting in the same table over an over every 1 second and restarting keeper servers randomly. If I restart a follower nothing happens, if I restart the leader a re-election happens and an insert fails but the next one, just one second after, works fine. I'll polish things up and create a PR.

@Algunenano
Copy link
Copy Markdown
Member

PR here: #42541

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replicated table takes too long to recover after ZK session expired

6 participants