-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix a race condition between batched peek and pop #8660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a race condition between batched peek and pop #8660
Conversation
…emoval pop may be lost
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
fdbserver/TLogServer.actor.cpp
Outdated
|
|
||
| state Version endVersion; | ||
| state bool onlySpilled; | ||
| ASSERT_WE_THINK(reqBegin >= poppedVersion(logData, reqTag)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use ASSERT instead? ASSERT_WE_THINK is noop in release build. Or, if this conditions triggers, return an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that we compute latest popped version using poppedVersion() instead of reading from (cached) poppedVer. So I want to avoid this computation in prod.
Also, I feel this should be an invarant rather than an error, so it's better to be caught in simulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is still going to be computed in non-simulation, it just won't trigger the assert. If you want to prevent it from running at all, you would need to do something like:
if (g_network->isSimulated()) { ASSERT(...); }
Also, I think this condition would warrant failing the tlog since we know it shouldn't be happening and if it does it is dangerous, but if it's too expensive to compute then we don't have to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poppedVersion() only does a map lookup, so it should be cheap. I am fine to keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous block handles the case for poppedVer > reqBegin, so it seems this assertion always true after all, when ENABLE_VERSION_VECTOR is not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also true when ENABLE_VERSION_VECTOR is enabled since we look up and set poppedVer at the end of that block. I'm in favor of just changing this to the ASSERT then if this is inexpensive, otherwise you can change it to the formulation I described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All, good suggestion. I change it to ASSERT.
Doxense CI Report for Windows 10
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS BigSur 11.5.2
|
Result of foundationdb-pr-macos on macOS Monterey 12.x
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
cc28310 to
1866c57
Compare
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-macos-m1 on macOS BigSur 11.5.2
|
Result of foundationdb-pr-macos on macOS Monterey 12.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
…version must be greater than latest pop version
9aa56ef to
2b7ed1d
Compare
Result of foundationdb-pr-clang on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
As @sfc-gh-ajbeamon found out, batching empty peek may lost pop generated by storage server removal, so a removed storage server may still live after the storage server removal.
This is not an issue when
PEEK_BATCHING_EMPTY_MSGis turned off.Also add some assertions to catch the invariant that when replying TLogPeek, the begin version must be greater than the popped version.
500K simulation test: 20221102-060846-zhewu_7.1_fix-604d5aec66036f65 compressed=True data_size=31702403 duration=26141298 ended=500000 fail_fast=10 max_runs=500000 pass=500000 priority=100 remaining=0 runtime=2:52:57 sanity=False started=501434 stopped=20221102-090143 submitted=20221102-060846 timeout=5400 username=zhewu_7.1_fix
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)