Skip to content

Conversation

@halfprice
Copy link
Contributor

@halfprice halfprice commented Jul 21, 2021

This PR cherry-picks the new single cluster gray failure detection and recovery mechanism to release 7.0 branch.

This feature is guarded by knobs and currently turned off by default.

Tests performed on this PR:

  • Key methods are tested using new unit tests added in this PR.
  • 100K joshua test: 20210812-000143-zhewu-5249-f9f9e5370f9954da
  • Real cluster test. Besides testing the functionality works properly, it also tests that turning off the knob disables the feature completely.
  • Compatibility test: making sure that running multiple versions in the same cluster won't break.

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or master if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: ed7a75d
  • Result: FAILED
  • Build Logs (available for 7 days)

@halfprice halfprice changed the base branch from master to release-6.3 July 21, 2021 23:45
@halfprice halfprice closed this Jul 21, 2021
@halfprice halfprice reopened this Jul 21, 2021
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: ed7a75d
  • Result: FAILED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: ed7a75d
  • Result: FAILED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: ed7a75d
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@halfprice halfprice force-pushed the zhewu/gray-failure-6.3 branch from ed7a75d to c4b0814 Compare July 31, 2021 06:22
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: c4b0814
  • Result: FAILED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: c4b0814
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@halfprice halfprice force-pushed the zhewu/gray-failure-6.3 branch from c4b0814 to f658d45 Compare August 11, 2021 23:27
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: f658d45
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@halfprice halfprice force-pushed the zhewu/gray-failure-6.3 branch from f658d45 to 590feae Compare August 12, 2021 00:02
@halfprice halfprice changed the title [DRAFT] Merge gray failure detection & recovery to 6.3 Cherry-pick Gray failure detection and recovery to release 6.3 Aug 12, 2021
@halfprice halfprice marked this pull request as ready for review August 12, 2021 00:04
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 590feae
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@halfprice halfprice merged commit e52b7b3 into apple:release-6.3 Aug 19, 2021
@halfprice halfprice deleted the zhewu/gray-failure-6.3 branch August 19, 2021 18:00

auto& health = workerHealth[workerAddress];

// First, remove any degraded peers recorded in the `workerHealth`, but aren't in the incoming request. These
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the format looks funny.
likely tab vs space.
can you run clang-format to the change?

// degraded since A is already considered as degraded.
std::unordered_set<NetworkAddress> currentDegradedServers;
for (const auto& [complainerCount, badServer] : count2DegradedPeer) {
for (const auto& complainer : degradedLinkDst2Src[badServer]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if a server is considered as degraded and excluded, all peers of the degraded server won't be considered as degraded.
is that a correct summary?

}
} else {
self->excludedDegradedServers.clear();
TraceEvent("DegradedServerDetectedAndSuggestRecovery");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the event more straightforward to SRE on which server to exclude before trigger recovery?
e.g., add a field.
.detail("Hint", "Exclude based on ClusterControllerHealthMonitor event and manually trigger recovery")

init( CC_MIN_DEGRADATION_INTERVAL, 120.0 );
init( CC_DEGRADED_PEER_DEGREE_TO_EXCLUDE, 3 );
init( CC_MAX_EXCLUSION_DUE_TO_HEALTH, 2 );
init( CC_HEALTH_TRIGGER_RECOVERY, false );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would simulation pass if we buggify this knob to true?

init( MIN_DELAY_CC_WORST_FIT_CANDIDACY_SECONDS, 10.0 );
init( MAX_DELAY_CC_WORST_FIT_CANDIDACY_SECONDS, 30.0 );
init( DBINFO_FAILED_DELAY, 1.0 );
init( ENABLE_WORKER_HEALTH_MONITOR, false );
Copy link
Contributor

@xumengpanda xumengpanda Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since SRE may toggle some of the knobs, it would be good to add if( randomize && BUGGIFY ) to test different values of those knobs.

@sfc-gh-jslocum sfc-gh-jslocum mentioned this pull request Aug 20, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants