Skip to content

Conversation

@jzhou77
Copy link
Contributor

@jzhou77 jzhou77 commented Aug 25, 2022

Cherrypick #7994

The localities are stored in ServerDBInfo for calculating distances to other
processes. The localities are not set when creating ServerDBInfo, thus any
distances calculated before UpdateServerDBInfoRequest will be wrong.

This PR fixes this issue, thus preventing unnecessary cross DC calls,
especially for index prefetching on the storage servers.

Also added more traces for load balancing.

100k 20220825-215215-jzhou-786b84e5fa1170a5 passed.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

The localities are stored in ServerDBInfo for calculating distances to other
processes. The localities are not set when creating ServerDBInfo, thus any
distances calculated before UpdateServerDBInfoRequest will be wrong.

This PR fixes this issue, thus preventing unnecessary cross DC calls,
especially for index prefetching on the storage servers.
@jzhou77 jzhou77 marked this pull request as ready for review August 25, 2022 22:07
@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

flowguru
flowguru previously approved these changes Aug 25, 2022
@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: f99e20b
  • Duration 0:39:12
  • Result: ❌ FAILED
  • Error: Error while executing command: python3 -m joshua.joshua start --tarball $(find build_output/packages -name correctness\*.tar.gz) --username codebuild-${COMMIT_SHA} --max-runs 10000. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

@jzhou77
Copy link
Contributor Author

jzhou77 commented Aug 26, 2022

CI failure is unrelated to this change:

python3 -m joshua.joshua start --tarball $(find build_output/packages -name correctness\*.tar.gz) --username codebuild-${COMMIT_SHA} --max-runs 10000
Traceback (most recent call last):
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/joshua/joshua.py", line 32, in <module>
    from . import joshua_model
  File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/joshua/joshua_model.py", line 35, in <module>
    import boto3
ModuleNotFoundError: No module named 'boto3'

@jzhou77 jzhou77 closed this Aug 26, 2022
@jzhou77 jzhou77 reopened this Aug 26, 2022
@jzhou77 jzhou77 requested a review from flowguru August 26, 2022 03:54
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS BigSur 11.5.2

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: f99e20b
  • Duration 0:38:40
  • Result: ❌ FAILED
  • Error: Error while executing command: python3 -m joshua.joshua start --tarball $(find build_output/packages -name correctness\*.tar.gz) --username codebuild-${COMMIT_SHA} --max-runs 10000. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

break;
} else if (badServers == alternatives->countBest() && i == badServers) {
TraceEvent("AllLocalAlternativesFailed")
.suppressFor(1.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do 0.1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be quite noisy, as we observed in tests. For prod clusters that are larger, I worry too many such events even if we suppress with 0.1s. So I choose 1s, i.e., at most 3600 per server hour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change it in main branch as well? Now we have diverging PR, it is not a strictly cherry pick anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Created #8007.

@jzhou77 jzhou77 closed this Aug 26, 2022
@jzhou77 jzhou77 reopened this Aug 26, 2022
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

@jzhou77 jzhou77 merged commit 5685f03 into apple:release-7.1 Aug 26, 2022
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: f99e20b
  • Duration 3:01:42
  • Result: ❌ FAILED
  • Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants