Skip to content

Conversation

@kakaiu
Copy link
Member

@kakaiu kakaiu commented Aug 11, 2021

In some long recovery, the recruiting_transaction_servers step takes abnormally long time.
We want to understand why the recruiting_transaction_servers step takes long time to finish.
A starting point of root cause analysis is to see whether (1) CC repeatedly retry the recruitment; or (2) the recruitment is delay to backend; or (3) some CC error occurs and CC dies.
For case (2) and (3), we already have RecruitFromConfigurationNotAvailable event and RecruitFromConfigurationError event to decide the case.
For case (1), we add RecruitFromConfigurationRetry event by this PR.
The appearance of this new event should be rare since (1) the event happens when no enough available servers in recruitment and (2) the retry (not in backend) lasts at most 1 second (set by WAIT_FOR_GOOD_RECRUITMENT_DELAY).

Passed Joshua correctness-7.1.0 test: 20210824-013255-zhewang-1f4960baeba2f8f6

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or master if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@kakaiu kakaiu requested a review from halfprice August 11, 2021 19:25
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 2bb1ea8
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

return Void();
} else if (e.code() == error_code_operation_failed || e.code() == error_code_no_more_servers) {
// recruitment not good enough, try again
TraceEvent("RecruitFromConfigurationRetry", self->id).error(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add whether goodRecruitmentTime is ready or not in the traceevent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also how long until it is ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was to add

.detail("GoodRecruitmentTimeReady", goodRecruitmentTime.isReady())

Regarding And also how long until it is ready., I was thinking whether we can print something that how long until the goodRecruitmentTime is ready (the amount of time left in the delay), but looking at Future class, I don't see a way to actually get this info, so you can ignore it for now.

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: ef0a20c
  • Result: FAILED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: ef0a20c
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

return Void();
} else if (e.code() == error_code_operation_failed || e.code() == error_code_no_more_servers) {
// recruitment not good enough, try again
TraceEvent("RecruitFromConfigurationRetry", self->id).error(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was to add

.detail("GoodRecruitmentTimeReady", goodRecruitmentTime.isReady())

Regarding And also how long until it is ready., I was thinking whether we can print something that how long until the goodRecruitmentTime is ready (the amount of time left in the delay), but looking at Future class, I don't see a way to actually get this info, so you can ignore it for now.

try {
auto rep = self->findWorkersForConfiguration(req);
req.reply.send(rep);
TraceEvent("RecruitFromConfigurationDone", self->id).detail("WaitTime", now() - startTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this. See my other comment.

@kakaiu kakaiu requested a review from halfprice August 24, 2021 01:28
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: 7f595f4
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 7f595f4
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@kakaiu kakaiu marked this pull request as ready for review August 24, 2021 02:14
@kakaiu kakaiu requested a review from RenxuanW August 24, 2021 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants