-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Prevent coordinated state from filling up with too many old generations #2814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent coordinated state from filling up with too many old generations #2814
Conversation
…s after 100 outstanding generations to prevent a death spiral from filling up the coordinated state
|
closes #2796 |
ajbeamon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apkar It's probably worth you thinking about this too.
The effects of this are that after 70 recoveries that reach the accepting commits stage without fully recovering, recoveries will be delayed an increasing amount equal to 1 minute per recovery over 70. Once it reaches 100 recoveries (about 7 hours after the 70th one), it stops trying to recover at all.
In either case, you can get a recovery to happen immediately by changing a knob --knob_max_generations_override to a value larger than the number of old generations. If you exceed the number of recoveries set by this knob while trying to subsequently recover, then you go back to the underlying behavior based on your number of recoveries, which could immediately stop the recovery process again. In other words, you should set a value that gives you margin for a few recoveries if it's expected that it might take a while to fully recover.
Being in one of these states can be detected by the fact that it logs a SevError event in both the delay and the stopped case. These will show up in status when it reports process errors. The details of the event can be found in the trace logs, which includes a message about what knob to set and the number of generations that need to be exceeded.
There are some cases where a cluster can be building up old generations for a while. For example, in a multi-region configuration where a DC is down for a while. This would restrict how many actions you can take on that cluster that would cause a recovery while in this state. Avoiding the recoveries is good anyway, as at least for now they come with a performance cost.
|
If the number of old generations is going to be an important metric, we should consider publishing it in status so that it can be easily monitored. If possible, we should make this value observable even in the face of a cluster that's not very healthy. |
I don't know much about how recoveries can fail, is it possible we can have recoveries failing repeatedly even before reaching |
|
We're only counting the number of old generations of logs, which is what makes these expensive. Recoveries that don't create them essentially don't count toward this limit. |
|
Just to reiterate what you said, in 7.0, status would have " number of old generations of logs", which would represent the recovery threshold we care about. By the way, this explanation is very useful, explains why this knob is useful. Can we have this in documentation? At least as a comment in |
|
I think the status should have this field in this PR. There are other changes we can make in 7.0, including making this state a separate recovery phase that can be viewed in the status document and handled specially in fdbcli, etc. |
Co-Authored-By: A.J. Beamon <[email protected]>
Co-Authored-By: A.J. Beamon <[email protected]>
delay recoveries after 70 outstanding generations, and stop recoveries after 100 outstanding generations to prevent a death spiral from filling up the coordinated state