Keeper retries during insert (clean)#42607
Conversation
Initial implementation was different and it changed the entire ReplicatedMergeTreeSink::commitPart() which change history provided by git blame. Then RetriesControl.retryLoop() was introduced later which significantly reduces the diff since it's like while() used before. So, check outing the current version will keep more original history in git blame, which is useful here
+ enable fault injection for functional tests
+ for multi-op request, it create responses in multi response + some polishing
+ throw immediately if table has no metadata in zk + stop retries in case of shutdown + check if table is readonly at the begining of every retry
|
Super nice feature. I just wanted to let you know I did some basic tests and didn't find any issues. Basically I've run 600 inserts (one every 0.1 seconds) while I applied chaos engineering (and I was the monkey restarting random ZK instances in a cluster of 3) and it worked without flaws. All inserts finished successfully, with the max delay being ~2200 ms: As expected there were a lot of errors in the background as connections where failing: |
@Algunenano Thanks. Your testing is very appreciated I just wanted to mention that the insert retries are doing the best effort job, so, if you're especially "lucky" the insert still can fail, - for example, - when a request to keeper which corresponds to committing a part to the keeper (see request) is failed and no data was written. In such case, we'll return |
|
Test failures to investigate:
@tavplubix @alesapin, I need your help here. It's not really clear to me what's going on there |
00502_custom_partitioning_replicated_zookeeper_long
+ 01346_alter_enum_partition_key_replicated_zookeeper_long
|
Integration test for keeper has failed. It's not related to this change, but worth to check. @antonio2368 would you mind to look at it? |
|
@devcrafter, you forgot to take a look at Stress Tests failures, some of them are related |
Stress test failures were caused by config for functional tests, which includes newly introduced settings. The settings are not known by previous ClickHouse versions, so the previous version failed to start The settings are reintroduced in #43117 |
|
@devcrafter |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support for keeper request retries during insert into replicated merge trees. Apart from fault tolerance, it aims to provide better user experience, - avoid returning a user an error during insert if keeper is restarted (for example, due to upgrade).
It also introduces failure injection mode for keeper requests during insert into replicated merge tree, so it's possible to inject a failure with certain probability when accessing keeper interface (see
insert_keeper_fault_injection_probability). It's also possible to reproduce a particular insert execution with failure injections by providing the same seed, seeinsert_keeper_fault_injection_seedCloses #39764
It should be merged instead of #41734. This PR contains all the changes from #41734, but commit history does less harm to git blame output (some details in comment to 5401300)