-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix parallel peek stalling for 10min when a TLog generation is destroyed #1818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parallel peek stalling for 10min when a TLog generation is destroyed #1818
Conversation
…yed. `peekTracker` was held on the Shared TLog (TLogData), whereas peeks are received and replied to as part of a TLog instance (LogData). When a peek was received on a TLog, it was registered into peekTracker along with the ReplyPromise. If the TLog was then removed as part of a no-longer-needed generation of TLogs, there is nothing left to reply to the request, but by holding onto the ReplyPromise in peekTracker, we leave the remote end with an expectation that we will reply. Then, 10min later, peekTrackerCleanup runs and finally times out the peek cursor, thus preventing FDB from being completely stuck. Now, each TLog generation has its own `peekTracker`, and when a TLog is destroyed, it times out all of the pending peek curors that are still expecting a response. This will then trigger the client to re-issue them to the next generation of TLogs, thus removing the 10min gap to do so.
|
test this please |
1 similar comment
|
test this please |
|
@fdb-build, test bindings please as a test of your configuration |
1 similar comment
|
@fdb-build, test bindings please as a test of your configuration |
And refactor some code to make adding more TLogVersions easier.
|
Performance test says that even with this code, the ~10min random break still exists. So the behavior here is still a bug, but apparently not the one I was chasing. |
|
I logged all the times that we send a
I'll leave correctness churning on this overnight, but this PR does seem to be an improvement even though it's not a fix. |
There were error cases that would cause a peek to terminate early or be cancelled without sending anything to the next peek in line. We would thus end up with the first peek in a sequence waiting on its future, and nothing that exists that would send to that future.
There's four stages to this test. I accidentally ran them with different test durations, so to reference points from the before image's timing:
The LogIngestRate of 0 for the secondary during 450-600 was solved by #1795, which is in master, but I removed from this branch to rule it out as part of debugging. The bug we're chasing here is that there is a gap between 1200 and 1800 where LogIngestRate for the secondary is 0. This gap has now been fixed. The minor dip is the cursors struggling to rapidly fast forward over a large number of versions, which they're not optimized to do well. |
The buggify was actually incorrect and broke an invariant, which I then fixed on the other side, but this work was actually unneeded in total. The real issue being fixed was returnIfBlock not sending an error, as well as the other error cases.
|
@etschannen, if there was something else I was supposed to do before this being merged, I've forgotten what it was. |


peekTrackerwas held on the Shared TLog (TLogData), whereas peeks arereceived and replied to as part of a TLog instance (LogData). When a
peek was received on a TLog, it was registered into peekTracker along
with the ReplyPromise. If the TLog was then removed as part of a
no-longer-needed generation of TLogs, there is nothing left to reply to
the request, but by holding onto the ReplyPromise in peekTracker, we
leave the remote end with an expectation that we will reply. Then,
10min later, peekTrackerCleanup runs and finally times out the peek
cursor, thus preventing FDB from being completely stuck.
Now, each TLog generation has its own
peekTracker, and when a TLog isdestroyed, it times out all of the pending peek curors that are still
expecting a response. This will then trigger the client to re-issue
them to the next generation of TLogs, thus removing the 10min gap to do
so.