Skip to content

Conversation

@sbodagala
Copy link
Contributor

This is in the context of HA-related version vector optimization: propagate the locality of the primary region from ClusterController to the sequencer, rather than deriving it from "MasterInterface" on the sequencer (which is not correct).

Testing:
Simulation tests:
Compiler: gcc.
Version vector disabled: 20220706-102147-sre-f3f7f09b7eac8fd9 (no failures).
Version vector enabled, knob ENABLE_VERSION_VECTOR_HA_OPTIMIZATION enabled: 20220706-103407-sre-180fb8af27a15e6d (no failures).

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

to the sequencer, rather than deriving it from "MasterInterface" on
the sequencer (which is not correct).
@sbodagala sbodagala requested review from dlambrig and jzhou77 July 6, 2022 12:07
@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: a84112d
  • Duration 0:40:14
  • Result: ❌ FAILED
  • Error: Error while executing command: tar -czf "${CODEBUILD_SRC_DIR}/packaging/docker/website/${FDB_VERSION}/fdb_${FDB_VERSION}.tar.gz" --directory "${CODEBUILD_SRC_DIR}/packaging/docker/website/${FDB_VERSION} .. Reason: exit status 2
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: a84112d
  • Duration 0:56:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

Copy link
Contributor

@dlambrig dlambrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@jzhou77 jzhou77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Make sure to test this in a HA setup before merging.

@sbodagala
Copy link
Contributor Author

sbodagala commented Jul 6, 2022

LGTM

Make sure to test this in a HA setup before merging.

I haven't run a Kubernetes test recently (and never explicitly ran a HA test), so it may take a while for me to do this. Is there an alternative way of verifying this? Or, maybe I can file another follow up radar for the HA testing part (if that is the only way of testing this)?

@jzhou77
Copy link
Contributor

jzhou77 commented Jul 6, 2022

LGTM
Make sure to test this in a HA setup before merging.

I haven't run a Kubernetes test recently (and never explicitly ran a HA test), so it may take a while for me to do this. Is there an alternative way of verifying this? Or, maybe I can file another follow up radar for the HA testing part (if that is the only way of testing this)?

It's probably easier to modify the simulator to use non-numeric DC IDs and reproduce the problem and verify the fix.

Ok, I will try that. Thanks!

@sbodagala
Copy link
Contributor Author

sbodagala commented Jul 7, 2022

It's probably easier to modify the simulator to use non-numeric DC IDs and reproduce the problem and verify the fix.

I applied the following diff:

> @@ -2124,7 +2126,7 @@ void setupSimulatedSystem(std::vector<Future<Void>>* systemActors,
>                                                                           ProcessClass::StatelessClass };
>         for (int dc = 0; dc < dataCenters; dc++) {
>                 // FIXME: test unset dcID
> -               Optional<Standalone<StringRef>> dcUID = StringRef(format("%d", dc));
> +               Optional<Standalone<StringRef>> dcUID = StringRef(format("data%d", dc));
>                 std::vector<UID> machineIdentities;
>                 int machines = machineCount / dataCenters +
>                                (dc < machineCount % dataCenters); // add remainder of machines to first datacenter
> 

The following test causes the sequencer to crash (trying to do "std::stoi()"), on "main", over the above diff:

build_output/bin/fdbserver -r simulation --crash -f /root/src/foundationdb/tests/fast/WriteDuringReadClean.toml -b off -s 22570166

The test succeeds over the changes in this PR. The primary locality id on the sequencer is getting set to "-1" (in this test).

@sbodagala sbodagala merged commit fa0c324 into apple:main Jul 7, 2022
sbodagala added a commit to sbodagala/foundationdb that referenced this pull request Jul 7, 2022
Make sequencer aware of the locality of the primary region
@sbodagala sbodagala mentioned this pull request Jul 7, 2022
5 tasks
@jzhou77
Copy link
Contributor

jzhou77 commented Jul 7, 2022

The test succeeds over the changes in this PR. The primary locality id on the sequencer is getting set to "-1" (in this test).

Can you create a PR with your test changes? I think it's worth making the change to catch any future regression.

@sbodagala
Copy link
Contributor Author

The test succeeds over the changes in this PR. The primary locality id on the sequencer is getting set to "-1" (in this test).

Can you create a PR with your test changes? I think it's worth making the change to catch any future regression.

I am not sure we can check in this change. Check the top comment in "SimulationConfig::setRegions()" - maybe there are similar assumptions at other places too.

jzhou77 pushed a commit that referenced this pull request Jul 8, 2022
Make sequencer aware of the locality of the primary region
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants