Skip to content

Keeper retries during insert#41734

Closed
devcrafter wants to merge 109 commits intomasterfrom
igor/insert_zk_retries
Closed

Keeper retries during insert#41734
devcrafter wants to merge 109 commits intomasterfrom
igor/insert_zk_retries

Conversation

@devcrafter
Copy link
Copy Markdown
Member

@devcrafter devcrafter commented Sep 23, 2022

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support for keeper request retries during insert into replicated merge trees. Apart from fault tolerance, it aims to provide better user experience, - avoid returning a user an error during insert if keeper is restarted (for example, due to upgrade).

It also introduces failure injection mode for keeper requests during insert into replicated merge tree, so it's possible to inject a failure with certain probability when accessing keeper interface (see insert_keeper_fault_injection_probability). It's also possible to reproduce a particular insert execution with failure injections by providing the same seed, see insert_keeper_fault_injection_seed

Closes #39764

@devcrafter devcrafter marked this pull request as draft September 23, 2022 13:53
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Sep 23, 2022
@devcrafter devcrafter changed the title ZK retries during insert Keeper retries during insert Sep 23, 2022
@tavplubix tavplubix self-assigned this Sep 23, 2022
+ do not count try if there is no error
@qoega
Copy link
Copy Markdown
Member

qoega commented Sep 23, 2022

Let's test it doing rolling restart of cluster of keepers in integration test as well? We should not fail inserts in this case?

@alexey-milovidov
Copy link
Copy Markdown
Member

Changelog entry is required.

@devcrafter devcrafter added the do not test disable testing on pull request label Sep 28, 2022
@Algunenano
Copy link
Copy Markdown
Member

Some notes about usability:

  • IMO It'd be much more comfortable to have a setting for the max wait time for ZK. The current proposal with 3 (insert_keeper_max_retries, insert_keeper_retry_initial_backoff_ms and insert_keeper_retry_max_backoff_ms) forces you to make calculations . This is important because max_execution_time isn't being considered in the sleep/wait, so it might be that you wait 40 seconds for ZK to recover just to get cancelled a moment later by having reached the query timeout (30 seconds ago).

  • A nice addition to the previous for visibility would be to add a profile event to see how much time was spent waiting for keeper. Right now you can see there were issues ZooKeeperTransactions':3,'ZooKeeperGet':1,'ZooKeeperMulti':2,'ZooKeeperHardwareExceptions':10 but it's hard to know how impactful that was.

In any case, this new feature is great and helps to reduce complexity on the user side.

try*() methods return error code but in case of connectivity errors
throw
this can be improved since different try*() method can returns error
(not throws) on different set of error, i.e. some can throw on
non-connectivity errors as well (tryGet() for example)
@devcrafter
Copy link
Copy Markdown
Member Author

devcrafter commented Oct 23, 2022

devcrafter added a commit that referenced this pull request Oct 24, 2022
Initial implementation was different and it changed the entire
ReplicatedMergeTreeSink::commitPart() which change history provided by git blame.
Then RetriesControl.retryLoop() was introduced later which significantly reduces
the diff since it's like while() used before.
So, check outing the current version will keep more original history in
git blame, which is useful here
@devcrafter
Copy link
Copy Markdown
Member Author

Further changes will be done (here, please check description for the reason behind)

@devcrafter devcrafter closed this Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retries on ZooKeeper errors inside INSERT

6 participants