-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Cherry-pick Gray failure detection and recovery to release 6.3 #5249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry-pick Gray failure detection and recovery to release 6.3 #5249
Conversation
AWS CodeBuild CI Report
|
AWS CodeBuild CI Report
|
AWS CodeBuild CI Report
|
AWS CodeBuild CI Report
|
ed7a75d to
c4b0814
Compare
AWS CodeBuild CI Report
|
AWS CodeBuild CI Report
|
…is change is only inside the worker.
… of ClusterControllerFullInterface
c4b0814 to
f658d45
Compare
AWS CodeBuild CI Report
|
f658d45 to
590feae
Compare
AWS CodeBuild CI Report
|
|
|
||
| auto& health = workerHealth[workerAddress]; | ||
|
|
||
| // First, remove any degraded peers recorded in the `workerHealth`, but aren't in the incoming request. These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the format looks funny.
likely tab vs space.
can you run clang-format to the change?
| // degraded since A is already considered as degraded. | ||
| std::unordered_set<NetworkAddress> currentDegradedServers; | ||
| for (const auto& [complainerCount, badServer] : count2DegradedPeer) { | ||
| for (const auto& complainer : degradedLinkDst2Src[badServer]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if a server is considered as degraded and excluded, all peers of the degraded server won't be considered as degraded.
is that a correct summary?
| } | ||
| } else { | ||
| self->excludedDegradedServers.clear(); | ||
| TraceEvent("DegradedServerDetectedAndSuggestRecovery"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the event more straightforward to SRE on which server to exclude before trigger recovery?
e.g., add a field.
.detail("Hint", "Exclude based on ClusterControllerHealthMonitor event and manually trigger recovery")
| init( CC_MIN_DEGRADATION_INTERVAL, 120.0 ); | ||
| init( CC_DEGRADED_PEER_DEGREE_TO_EXCLUDE, 3 ); | ||
| init( CC_MAX_EXCLUSION_DUE_TO_HEALTH, 2 ); | ||
| init( CC_HEALTH_TRIGGER_RECOVERY, false ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would simulation pass if we buggify this knob to true?
| init( MIN_DELAY_CC_WORST_FIT_CANDIDACY_SECONDS, 10.0 ); | ||
| init( MAX_DELAY_CC_WORST_FIT_CANDIDACY_SECONDS, 30.0 ); | ||
| init( DBINFO_FAILED_DELAY, 1.0 ); | ||
| init( ENABLE_WORKER_HEALTH_MONITOR, false ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since SRE may toggle some of the knobs, it would be good to add if( randomize && BUGGIFY ) to test different values of those knobs.
This PR cherry-picks the new single cluster gray failure detection and recovery mechanism to release 7.0 branch.
This feature is guarded by knobs and currently turned off by default.
Tests performed on this PR:
Code-Reviewer Section
The general guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormasterif this is the youngest branch)