Cherry-pick Gray failure detection and recovery to release 6.3 #5249

halfprice · 2021-07-21T23:40:57Z

This PR cherry-picks the new single cluster gray failure detection and recovery mechanism to release 7.0 branch.

This feature is guarded by knobs and currently turned off by default.

Tests performed on this PR:

Key methods are tested using new unit tests added in this PR.
100K joshua test: 20210812-000143-zhewu-5249-f9f9e5370f9954da
Real cluster test. Besides testing the functionality works properly, it also tests that turning off the knob disables the feature completely.
Compatibility test: making sure that running multiple versions in the same cluster won't break.

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or master if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci · 2021-07-21T23:45:03Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build-macos
Commit ID: ed7a75d
Result: FAILED
Build Logs (available for 7 days)

foundationdb-ci · 2021-07-21T23:49:42Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build-macos
Commit ID: ed7a75d
Result: FAILED
Build Logs (available for 7 days)

foundationdb-ci · 2021-07-21T23:59:08Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build
Commit ID: ed7a75d
Result: FAILED
Build Logs (available for 7 days)

foundationdb-ci · 2021-07-22T00:06:06Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build
Commit ID: ed7a75d
Result: SUCCEEDED
Build Logs (available for 7 days)

foundationdb-ci · 2021-07-31T06:54:06Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build
Commit ID: c4b0814
Result: FAILED
Build Logs (available for 7 days)

foundationdb-ci · 2021-07-31T07:21:18Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build-macos
Commit ID: c4b0814
Result: SUCCEEDED
Build Logs (available for 7 days)

…is change is only inside the worker.

… of ClusterControllerFullInterface

…r controller

foundationdb-ci · 2021-08-11T23:56:27Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build
Commit ID: f658d45
Result: SUCCEEDED
Build Logs (available for 7 days)

foundationdb-ci · 2021-08-12T00:26:06Z

AWS CodeBuild CI Report

CodeBuild project: foundationdb-pull-request-build
Commit ID: 590feae
Result: SUCCEEDED
Build Logs (available for 7 days)

xumengpanda · 2021-08-19T18:36:30Z

fdbserver/ClusterController.actor.cpp

+
+		auto& health = workerHealth[workerAddress];
+
+		// First, remove any degraded peers recorded in the `workerHealth`, but aren't in the incoming request. These


the format looks funny.
likely tab vs space.
can you run clang-format to the change?

xumengpanda · 2021-08-19T18:56:21Z

fdbserver/ClusterController.actor.cpp

+		// degraded since A is already considered as degraded.
+		std::unordered_set<NetworkAddress> currentDegradedServers;
+		for (const auto& [complainerCount, badServer] : count2DegradedPeer) {
+			for (const auto& complainer : degradedLinkDst2Src[badServer]) {


so if a server is considered as degraded and excluded, all peers of the degraded server won't be considered as degraded.
is that a correct summary?

xumengpanda · 2021-08-19T19:20:29Z

fdbserver/ClusterController.actor.cpp

+						}
+					} else {
+						self->excludedDegradedServers.clear();
+						TraceEvent("DegradedServerDetectedAndSuggestRecovery");


can we make the event more straightforward to SRE on which server to exclude before trigger recovery?
e.g., add a field.
.detail("Hint", "Exclude based on ClusterControllerHealthMonitor event and manually trigger recovery")

xumengpanda · 2021-08-19T19:24:25Z

fdbserver/Knobs.cpp

+	init( CC_MIN_DEGRADATION_INTERVAL,                         120.0 );
+	init( CC_DEGRADED_PEER_DEGREE_TO_EXCLUDE,                      3 );
+	init( CC_MAX_EXCLUSION_DUE_TO_HEALTH,                          2 );
+	init( CC_HEALTH_TRIGGER_RECOVERY,                          false );


would simulation pass if we buggify this knob to true?

xumengpanda · 2021-08-19T19:25:48Z

fdbserver/Knobs.cpp

 	init( MIN_DELAY_CC_WORST_FIT_CANDIDACY_SECONDS,             10.0 );
 	init( MAX_DELAY_CC_WORST_FIT_CANDIDACY_SECONDS,             30.0 );
 	init( DBINFO_FAILED_DELAY,                                   1.0 );
+	init( ENABLE_WORKER_HEALTH_MONITOR,                        false );


since SRE may toggle some of the knobs, it would be good to add if( randomize && BUGGIFY ) to test different values of those knobs.

halfprice changed the base branch from master to release-6.3 July 21, 2021 23:45

halfprice closed this Jul 21, 2021

halfprice reopened this Jul 21, 2021

halfprice force-pushed the zhewu/gray-failure-6.3 branch from ed7a75d to c4b0814 Compare July 31, 2021 06:22

halfprice added 6 commits August 11, 2021 16:27

Create health monitor in FDB workers to monitor network condition. Th…

46db5f9

…is change is only inside the worker.

Addressing comments.

ad46f4b

Add updateWorkerHealth interface in cluster controller

335443b

Fix endpoint ordering by moving the new updateWorkerHealth to the end…

6f7dffe

… of ClusterControllerFullInterface

Implement the core logic of gray network triggered recovery in cluste…

405495e

…r controller

Address conflict and adjust to 6.3 code

410daaa

halfprice force-pushed the zhewu/gray-failure-6.3 branch from c4b0814 to f658d45 Compare August 11, 2021 23:27

Bug fixes and address compatibility

590feae

halfprice force-pushed the zhewu/gray-failure-6.3 branch from f658d45 to 590feae Compare August 12, 2021 00:02

halfprice changed the title ~~[DRAFT] Merge gray failure detection & recovery to 6.3~~ Cherry-pick Gray failure detection and recovery to release 6.3 Aug 12, 2021

halfprice requested review from RenxuanW, jzhou77 and xumengpanda August 12, 2021 00:04

halfprice marked this pull request as ready for review August 12, 2021 00:04

jzhou77 assigned xumengpanda Aug 16, 2021

jzhou77 approved these changes Aug 16, 2021

View reviewed changes

RenxuanW approved these changes Aug 19, 2021

View reviewed changes

halfprice merged commit e52b7b3 into apple:release-6.3 Aug 19, 2021

halfprice deleted the zhewu/gray-failure-6.3 branch August 19, 2021 18:00

xumengpanda reviewed Aug 19, 2021

View reviewed changes

sfc-gh-jslocum mentioned this pull request Aug 20, 2021

Fixing Trace Event Names #5431

Merged

5 tasks


		auto& health = workerHealth[workerAddress];

		// First, remove any degraded peers recorded in the `workerHealth`, but aren't in the incoming request. These

Cherry-pick Gray failure detection and recovery to release 6.3 #5249

Cherry-pick Gray failure detection and recovery to release 6.3 #5249

Uh oh!

Conversation

halfprice commented Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code-Reviewer Section

For Release-Branches

Uh oh!

foundationdb-ci commented Jul 21, 2021

AWS CodeBuild CI Report

Uh oh!

foundationdb-ci commented Jul 21, 2021

AWS CodeBuild CI Report

Uh oh!

foundationdb-ci commented Jul 21, 2021

AWS CodeBuild CI Report

Uh oh!

foundationdb-ci commented Jul 22, 2021

AWS CodeBuild CI Report

Uh oh!

foundationdb-ci commented Jul 31, 2021

AWS CodeBuild CI Report

Uh oh!

foundationdb-ci commented Jul 31, 2021

AWS CodeBuild CI Report

Uh oh!

foundationdb-ci commented Aug 11, 2021

AWS CodeBuild CI Report

Uh oh!

foundationdb-ci commented Aug 12, 2021

AWS CodeBuild CI Report

Uh oh!

xumengpanda Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

xumengpanda Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

xumengpanda Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

xumengpanda Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

xumengpanda Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

halfprice commented Jul 21, 2021 •

edited

Loading

xumengpanda Aug 19, 2021 •

edited

Loading