Replicated table takes too long to recover after ZK session expired

This is running `22.8.5.29` but master/HEAD has the same code

When looking at ingestion errors in a cluster, we've noticed that recovering from ZK errors (leader change for example) is extremely slow  and some tables might take 30 or 40 seconds to recover.

From the logs, using `d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd` table I see that at 07:31:05 it lost connection with ZK, so the table was setup as readonly

```
2022.10.07 07:31:00.547839 [ 17483 ] {60aff979-f4d2-4aaa-a0af-f979f4d22aaa::202210_544697_545024_100} <Information> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (60aff979-f4d2-4aaa-a0af-f979f4d22aaa): The part /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.11/parts/202210_544697_545024_100 on a replica suddenly appeared, will recheck checksums
2022.10.07 07:31:01.489333 [ 17479 ] {60aff979-f4d2-4aaa-a0af-f979f4d22aaa::202210_544697_545025_101} <Information> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (60aff979-f4d2-4aaa-a0af-f979f4d22aaa): The part /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.7/parts/202210_544697_545025_101 on a replica suddenly appeared, will recheck checksums
2022.10.07 07:31:03.007135 [ 17469 ] {60aff979-f4d2-4aaa-a0af-f979f4d22aaa::202210_544697_545027_103} <Information> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (60aff979-f4d2-4aaa-a0af-f979f4d22aaa): The part /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.11/parts/202210_544697_545027_103 on a replica suddenly appeared, will recheck checksums
2022.10.07 07:31:05.510536 [ 17468 ] {60aff979-f4d2-4aaa-a0af-f979f4d22aaa::202210_544697_545030_106} <Information> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (60aff979-f4d2-4aaa-a0af-f979f4d22aaa): The part /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.11/parts/202210_544697_545030_106 on a replica suddenly appeared, will recheck checksums
2022.10.07 07:31:07.843725 [ 17621 ] {} <Error> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (60aff979-f4d2-4aaa-a0af-f979f4d22aaa): void DB::StorageReplicatedMergeTree::queueUpdatingTask(): Code: 999. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
2022.10.07 07:31:07.843932 [ 17549 ] {} <Error> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (60aff979-f4d2-4aaa-a0af-f979f4d22aaa): void DB::StorageReplicatedMergeTree::mutationsUpdatingTask(): Code: 999. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
2022.10.07 07:31:07.882759 [ 17539 ] {} <Warning> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (ReplicatedMergeTreeRestartingThread): ZooKeeper session has expired. Switching to a new session.
2022.10.07 07:31:07.890731 [ 17539 ] {} <Error> zkutil::EphemeralNodeHolder::~EphemeralNodeHolder(): Cannot remove /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.6/is_active: : Code: 999. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
2022.10.07 07:31:08.114421 [ 19476 ] {d427351c-802d-4519-8c39-586649487b14} <Error> executeQuery: Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.6
```


Then we get something like 25 seconds of constant errors during ingestion:
```
`2022.10.07 07:31:08.114421 [ 19476 ] {d427351c-802d-4519-8c39-586649487b14} <Error> executeQuery: Code: 242. DB::Exception: Table is in readonly mode (replica path: /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.6). (TABLE_IS_READ_ONLY) (version 22.8.5.29 (official build)) (from 10.150.0.2:42676) (in query: INSERT INTO d_10d4a5.t_6d6e580a046240a589711df47ee79a4a(COLUMN LIST) FORMAT RowBinary ), Stack trace (when copying this message, always include the lines below):`
```

And finally, after 35 seconds of "downtime" the recovery:
```
2022.10.07 07:31:40.092680 [ 17537 ] {} <Warning> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (ReplicatedMergeTreeRestartingThread): Table was in readonly mode. Will try to activate it.
2022.10.07 07:31:42.153350 [ 17459 ] {60aff979-f4d2-4aaa-a0af-f979f4d22aaa::202210_544697_545032_108} <Information> d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd (60aff979-f4d2-4aaa-a0af-f979f4d22aaa): The part /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.11/parts/202210_544697_545032_108 on a replica suddenly appeared, will recheck checksums
```

The restarting thread is setup to retry every 10 seconds (reduced to 1 second currently in master) BUT, the first run is throwing so it's not re-adding the task:
```
2022.10.07 07:31:07.890731 [ 17539 ] {} <Error> zkutil::EphemeralNodeHolder::~EphemeralNodeHolder(): Cannot remove /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.6/is_active: : Code: 999. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
```

And the full exception:
```
2022.10.07 07:31:07.890731 [ 17539 ] {} <Error> zkutil::EphemeralNodeHolder::~EphemeralNodeHolder(): Cannot remove /clickhouse/tables/01-01/d_10d4a5.t_4819fcef5b0a46eab5131a98330a5edd/replicas/10.150.0.6/is_active: : Code: 999. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa3c597a in /usr/bin/clickhouse
1. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 0x160ab4d6 in /usr/bin/clickhouse
2. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error) @ 0x160ab75c in /usr/bin/clickhouse
3. Coordination::ZooKeeper::pushRequest(Coordination::ZooKeeper::RequestInfo&&) @ 0x160fa2d6 in /usr/bin/clickhouse
4. Coordination::ZooKeeper::remove(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, std::__1::function<void (Coordination::RemoveResponse const&)>) @ 0x160fb39f in /usr/bin/clickhouse
5. zkutil::ZooKeeper::asyncTryRemoveNoThrow(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x160b5d59 in /usr/bin/clickhouse
6. zkutil::ZooKeeper::removeImpl(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x160b59c5 in /usr/bin/clickhouse
7. zkutil::ZooKeeper::tryRemove(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x160b5f4d in /usr/bin/clickhouse
8. zkutil::EphemeralNodeHolder::~EphemeralNodeHolder() @ 0xa579fc1 in /usr/bin/clickhouse
9. DB::ReplicatedMergeTreeRestartingThread::partialShutdown(bool) @ 0x159d1c70 in /usr/bin/clickhouse
10. DB::ReplicatedMergeTreeRestartingThread::runImpl() @ 0x159c988c in /usr/bin/clickhouse
11. DB::ReplicatedMergeTreeRestartingThread::run() @ 0x159c64be in /usr/bin/clickhouse
12. DB::BackgroundSchedulePoolTaskInfo::execute() @ 0x13e3e858 in /usr/bin/clickhouse
13. DB::BackgroundSchedulePool::threadFunction() @ 0x13e41136 in /usr/bin/clickhouse
14. ? @ 0x13e41e6c in /usr/bin/clickhouse
15. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0xa487348 in /usr/bin/clickhouse
16. ? @ 0xa48a55d in /usr/bin/clickhouse
17. start_thread @ 0x76db in /lib/x86_64-linux-gnu/libpthread-2.27.so
18. __clone @ 0x12161f in /lib/x86_64-linux-gnu/libc-2.27.so
 (version 22.8.5.29 (official build))
```


As far as I can see the problem is that when the ZK session is lost that does a partial shutdown of the table. This partial shutdown includes resetting `storage.replica_is_active_node` which is an Ephemeral node living in ZK. As the ZK session is lost, that fails and throws an exception which isn't captured inside `ReplicatedMergeTreeRestartingThread::run()` so the task isn't re-scheduled.

I see several ways to fix it:
- Set `storage.replica_is_active_node->need_remove` to false before resetting. This is tricky because that should only be done when CH losses the connection with ZK, not on all shutdowns.
- Capture exceptions when resetting `storage.replica_is_active_node = nullptr` and ignore them to continue with the shutdown. We should check that the rest of the operations can actually be performed.
- Capture all exceptions of `partialShutdown` so the `run()` can continue and re-schedule a check on time (10 seconds of 1 second now).
- Don't call  `partialShutdown` on ZK session lost and only set it as read-only (without doing the rest of the cleanup).


In my opinion the second option seems to be the most reasonable, as it would allow to shutdown the other threads/tasks. This is good if the ZK session is lost for a good time, but it might be preferable to keep them alive if it's just for a few seconds (like a leader change). Any thoughts @tavplubix ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicated table takes too long to recover after ZK session expired #42251

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replicated table takes too long to recover after ZK session expired #42251

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions