Skip to content

Conversation

@halfprice
Copy link
Contributor

@halfprice halfprice commented Mar 31, 2023

We recently added a test to test exclude failed command which exposed this bug in the existing code.

When a client is issuing an ExclusionSafetyCheckRequest to check safe exclusion, if during this time, there is a proxy change, this will cancel the current basicLoadBalance() function along with the future created for the ExclusionSafetyCheckReply. However, when we re-issue the ExclusionSafetyCheckRequest again, we will use the same req object, whose reply future has already being cancelled. This causes the request to throw broken promise immediately, and then the host will mistakenly think the remote endpoint is failed.

This leads to the ExclusionSafetyCheckRequest to be blocked forever.

The fix is to re-create the ExclusionSafetyCheckRequest every time when issuing a new ExclusionSafetyCheckRequest.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@halfprice halfprice requested a review from jzhou77 March 31, 2023 23:32
Copy link
Contributor

@jzhou77 jzhou77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to fix is to call resetReply()

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: d903d0c
  • Duration 0:15:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: d903d0c
  • Duration 0:28:24
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: d903d0c
  • Duration 0:32:49
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@jzhou77
Copy link
Contributor

jzhou77 commented Apr 1, 2023

I think the intention to use state was to create the request only once. So an alternative fix is to call resetReply(req) after the choose() block. I wish flow compiler is smart enough to catch such bugs.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: d903d0c
  • Duration 1:00:12
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@halfprice
Copy link
Contributor Author

I think the intention to use state was to create the request only once. So an alternative fix is to call resetReply(req) after the choose() block. I wish flow compiler is smart enough to catch such bugs.

That works too. All the other function using basicLoadBalance create the request while calling it. So maybe just be consistent with the rest of the code?

@jzhou77 jzhou77 merged commit b18da9e into apple:main Apr 1, 2023
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: d903d0c
  • Duration 2:01:27
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants