(faster session reinitialization) Remove timeout for reestablishing new connection in case of read-only table error#42323
Conversation
|
This test is based on sleep, so I will just see how flaky it is. |
| { | ||
| storage.replica_is_active_node = nullptr; | ||
| } | ||
| catch (...) |
There was a problem hiding this comment.
This looks unnecessary since ~EphemeralNodeHolder already catch all exceptions.
There was a problem hiding this comment.
Indeed. Seems like issue is not here as was described in the issue
|
@Algunenano As was mentioned above EphemeralNodeHolder doesn't throw any exception outside but just catches it and prints to log. I checked and seems like no functions that we execute during partial shutdown/initialization throws any exceptions, but nevertheless I removed even 1s timeout to recheck the health and replica state and I rewrote the code a bit to make it more reliable in my opinion |
|
I've tested this PR by having 1 insert per second on a replicated table and restarting ZK and it sill takes ~30 seconds to recover. The same operation takes ~10 seconds on 21.8: With the PR is goes from 19:04:38 to 19:05:10 (32 seconds still). My feeling is that this time should be much lower, not just lower than the 30+ seconds in master but lower than the 10 seconds of 21.8. I'll try (tomorrow) with a ZK cluster instead of a standalone one where the recovery should be immediate, but I think that failed operations should influence how often the check happens. Not sure if the query itself should try to recover the ZK connection (which would be nice if possible) or at least advance the check to a reasonable time (so it doesn't happen a lot, but maybe once a second). I need to think more about it. |
|
The only idea I have is that reconnection slowness happens because of some timeouts inside ZooKeeper (like timeout for deleting ephemeral nodes, etc) |
|
It seems that there is another problem as initializing the replica is taking ~20 seconds: I'm investigating more to see what's going on |
|
Well, narrowing it down was easy. It's being spent in waitForEphemeralToDisappearIfAny. I guess that since we couldn't remove the lock (as there isn't a connection to ZK), it finds it and waits for it to be removed automatically by ZK (it's an |
|
Following up:
I'm surprised about the change in that PR because it's set up to throw a LOGICAL_ERROR if the node takes too long to disappear, which doesn't seem the right exception to me considering I was expecting this to handle configuration issues (where a user creates several servers with the same config by mistake). While this behaviour might be desirable in some cases vs the removal of the node, I think that in cases of a lost session it's doing a disservice. It'd make sense to keep a flag of "unclean shutdown" and remove the old node if necessary, but I hope there is a simpler/cleaner way. |
|
Well, we already discussed internally to remove this code (and return back explicit deletion of this node on start). AFAIK we found a couple of bugs, so LOGICAL_ERROR is a bit overkill but forces us to look at any errors.. |
|
I'm currently working on going back to deletion of the node if the storage was |
|
Probably we can just return the old code, but make it more smart - e.g. write UUID of server to Zookeeper and then check before deleting the node. |
Yes, it's not a logical error actually and we can change the error code. But usually 3x session timeout should be enough for ephemeral node to get removed. And we want to check it in CI to catch issues like #28519 and #41215.
OK, maybe we can resurrect code that removes the node, but we should do it carefully, because ephemeral nodes are kind of distributed locks and ideologically it's not quite correct to forcefully remove them (we cannot unlock a lock if it's held by someone else). But yes, adding |
I think it's not necessary to change the node identifier as it's random and it won't change because a table restart. We can have a logic like:
|
|
I've setup a local 3-server keeper cluster to do tests closer to a production environment, as before I was testing with a single node which isn't realistic or advisable. With this PR and the ephemeral node fix (delete if the same instead of waiting for it to go away) the time to recovery is less than a second. I'm testing the time to recovery by ingesting in the same table over an over every 1 second and restarting keeper servers randomly. If I restart a follower nothing happens, if I restart the leader a re-election happens and an insert fails but the next one, just one second after, works fine. I'll polish things up and create a PR. |
|
PR here: #42541 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
When ZooKeeper session expires the MergeTreeRestartingThread will be notified automatically and almost instantly but it need to try several times to establish new connection. This PR enables retries without any timeout after first error occur. This closes #42251