Skip to content

Conversation

@dlambrig
Copy link
Contributor

@dlambrig dlambrig commented Jul 1, 2022

Cherry pick PR 7462

When there are no user commits, versions advance at a minimum rate governed by knob MAX_COMMIT_BATCH_INTERVAL. FDB needs continuously generated commits, e.g. in shard movement where new versions are "piggy backed" to make progress.

The actors commitBatcher and commitProxyServerCore generate these commits. The problem is they run asynchronously to each other and can get out of sync. The former times out every MAX_COMMIT_BATCH_INTERVAL seconds and wakes the later, which may have sent an empty message less than MAX_COMMIT_BATCH_INTERVAL seconds ago (it may have delayed the previous commit). In that case, commitProxyServerCore won't send the empty commit, and a new commit won't be generated until commitBatcher times out again.

This pattern delays version advancement up to MAX_COMMIT_BATCH_INTERVAL*2 seconds and breaks the knob's contract. Shard movement is slowed, eventually leading to the failure in quiet database.

This PR makes a single actor (commitBatcher) regulate the frequency of empty commits.

Fixes
MutationLogReaderCorrectness.toml -b on -s 653691063;
MutationLogReaderCorrectness.toml -b on -s 98766466
etc. (more).

Joshua
20220701-220644-henrylambright-f99f08bd5fb313ef

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@dlambrig dlambrig requested review from jzhou77 and sbodagala July 1, 2022 22:09
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 9176667
  • Duration 0:37:25
  • Result: ❌ FAILED
  • Error: Error while executing command: tar -czf "${CODEBUILD_SRC_DIR}/packaging/docker/website/${FDB_VERSION}/fdb_${FDB_VERSION}.tar.gz" --directory "${CODEBUILD_SRC_DIR}/packaging/docker/website/${FDB_VERSION} .. Reason: exit status 2
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 9176667
  • Duration 0:56:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS BigSur 11.5.2

  • Commit ID: 2ae4f8d
  • Duration 0:33:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 2ae4f8d
  • Duration 0:37:00
  • Result: ❌ FAILED
  • Error: Error while executing command: tar -czf "${CODEBUILD_SRC_DIR}/packaging/docker/website/${FDB_VERSION}/fdb_${FDB_VERSION}.tar.gz" --directory "${CODEBUILD_SRC_DIR}/packaging/docker/website/${FDB_VERSION} .. Reason: exit status 2
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 2ae4f8d
  • Duration 0:58:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@dlambrig dlambrig merged commit 8a699d0 into release-7.1 Jul 5, 2022
@dlambrig dlambrig deleted the cherry7462 branch July 5, 2022 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants