Skip to content

Conversation

@etschannen
Copy link
Contributor

delay recoveries after 70 outstanding generations, and stop recoveries after 100 outstanding generations to prevent a death spiral from filling up the coordinated state

…s after 100 outstanding generations to prevent a death spiral from filling up the coordinated state
@etschannen
Copy link
Contributor Author

closes #2796

Copy link
Contributor

@ajbeamon ajbeamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apkar It's probably worth you thinking about this too.

The effects of this are that after 70 recoveries that reach the accepting commits stage without fully recovering, recoveries will be delayed an increasing amount equal to 1 minute per recovery over 70. Once it reaches 100 recoveries (about 7 hours after the 70th one), it stops trying to recover at all.

In either case, you can get a recovery to happen immediately by changing a knob --knob_max_generations_override to a value larger than the number of old generations. If you exceed the number of recoveries set by this knob while trying to subsequently recover, then you go back to the underlying behavior based on your number of recoveries, which could immediately stop the recovery process again. In other words, you should set a value that gives you margin for a few recoveries if it's expected that it might take a while to fully recover.

Being in one of these states can be detected by the fact that it logs a SevError event in both the delay and the stopped case. These will show up in status when it reports process errors. The details of the event can be found in the trace logs, which includes a message about what knob to set and the number of generations that need to be exceeded.

There are some cases where a cluster can be building up old generations for a while. For example, in a multi-region configuration where a DC is down for a while. This would restrict how many actions you can take on that cluster that would cause a recovery while in this state. Avoiding the recoveries is good anyway, as at least for now they come with a performance cost.

@ajbeamon
Copy link
Contributor

If the number of old generations is going to be an important metric, we should consider publishing it in status so that it can be easily monitored. If possible, we should make this value observable even in the face of a cluster that's not very healthy.

@apkar
Copy link
Contributor

apkar commented Mar 13, 2020

The effects of this are that after 70 recoveries that reach the accepting commits stage without fully recovering, recoveries will be delayed an increasing amount equal to 1 minute per recovery over 70. Once it reaches 100 recoveries (about 7 hours after the 70th one), it stops trying to recover at all.

I don't know much about how recoveries can fail, is it possible we can have recoveries failing repeatedly even before reaching accepting commits stage? From your explanation it feels like only the recoveries that have reached accepting commits stage are accounted.

@ajbeamon
Copy link
Contributor

We're only counting the number of old generations of logs, which is what makes these expensive. Recoveries that don't create them essentially don't count toward this limit.

@apkar
Copy link
Contributor

apkar commented Mar 13, 2020

Just to reiterate what you said, in 7.0, status would have " number of old generations of logs", which would represent the recovery threshold we care about.

By the way, this explanation is very useful, explains why this knob is useful. Can we have this in documentation? At least as a comment in knobs.cpp|.h would be super helpful for reference.

@ajbeamon
Copy link
Contributor

I think the status should have this field in this PR. There are other changes we can make in 7.0, including making this state a separate recovery phase that can be viewed in the status document and handled specially in fdbcli, etc.

@ajbeamon ajbeamon merged commit f2defc3 into apple:release-6.2 Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants