-
Notifications
You must be signed in to change notification settings - Fork 1.5k
The sharded rocks implementation skipped on TSS handling. Backfill. #9759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
saintstack
commented
Mar 21, 2023
- fdbserver/MoveKeys.actor.cpp Make startMoveShards work like startMoveKeys handling TSS. Make finishMoveShards work like finishMoveKeys handling TSS.
|
On Testing:
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Doxense CI Report for Windows 10
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
|
More testing. Added below diff when running 2.5M simulation (suggested by @liquid-helium ): Most tests run w/ sharded rocks with above change. Lots of failures -- 789 -- but an overview doesn't find any fails because of TSS'ing. |
liquid-helium
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this!
fdbserver/MoveKeys.actor.cpp
Outdated
|
|
||
| state Transaction tr(occ); | ||
| // RYW to optimize re-reading the same key ranges. See readTSSMappingRYW | ||
| state Reference<ReadYourWritesTransaction> tr = makeReference<ReadYourWritesTransaction>(occ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to use ReadYourWritesTransaction here? Transaction was used intentionally since ReadYourWritesTransaction is not compatible with unassignServerKeys(), which is called within startMoveShards(), currently it is not actually called since the condition is always false, but it could be triggered in the future.
Sorry I should have added a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was modeling what startMoveKeys does. The call to readTSSMappingRYW wants a RYWTransaction. Let me go back to plain Transaction (and it looks like readTSSMappingRYW internally doesn't actually need RYWT....).
fdbserver/MoveKeys.actor.cpp
Outdated
| wait(checkMoveKeysLock(&(tr->getTransaction()), lock, ddEnabledState)); | ||
|
|
||
| Optional<Value> val = wait(tr.get(dataMoveKeyFor(dataMoveId))); | ||
| if (!loadedTssMapping) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this readTSSMappingRYW() block to line 1335, after we evaluate the DataMoveMetaData since we could exit early and end up not using the loaded TssMapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one.
| readyServers.push_back(storageServerInterfaces[s].uniqueID); | ||
| } | ||
| } | ||
| int tssCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we checking if tssCount is the same as the intended tssCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again modeling what is done for the move of keys, it does a tssCount just to print. We are just doing same here.
|
Address @liquid-helium feedback. |
|
Rebased and addressed @liquid-helium comments. Reran simulation first with default 100k test count. Two failures. Unrelated. Next ran 1M simulations w/ below patch in place to favor sharded rocks and tss: 30 odd failures... Here is the list sorted and uniq'd: The TestFile="tests/fast/PhysicalShardMove.toml" looked like it could be related but the complaints were Reason="Consistency check: Shard size is more than 7.000000 std dev from estimate" ... an issue being worked on in another issue. Identifier for my simulation runs was 20230331-041546-stack-9d22ddb0aa037311 |
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
1 similar comment
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
|
Non-change to retriever CI build (failures seem to be related to missing docker deploy found by He Liu) |
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr on Linux CentOS 7
|
fdbserver/MoveKeys.actor.cpp
Outdated
|
|
||
| // Update dest servers for this range to be equal to servers | ||
| krmSetPreviouslyEmptyRange(&tr, | ||
| krmSetPreviouslyEmptyRange(&(tr.getTransaction()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: will &tr be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Was copied from startMoveKeys. Let me fix.
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
|
The "Result of foundationdb-pr-cluster-tests on Linux CentOS 7" failure is the test_ycsb k8s integration test failing to deploy the cluster in the Result of 'Result of foundationdb-pr on Linux CentOS 7' failure is below unit tests failing: Unit tests passed for another concurrent PR so perhaps related? Retrying. |
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|