Skip to content

Conversation

@alexmiller-apple
Copy link
Contributor

peekTracker was held on the Shared TLog (TLogData), whereas peeks are
received and replied to as part of a TLog instance (LogData). When a
peek was received on a TLog, it was registered into peekTracker along
with the ReplyPromise. If the TLog was then removed as part of a
no-longer-needed generation of TLogs, there is nothing left to reply to
the request, but by holding onto the ReplyPromise in peekTracker, we
leave the remote end with an expectation that we will reply. Then,
10min later, peekTrackerCleanup runs and finally times out the peek
cursor, thus preventing FDB from being completely stuck.

Now, each TLog generation has its own peekTracker, and when a TLog is
destroyed, it times out all of the pending peek curors that are still
expecting a response. This will then trigger the client to re-issue
them to the next generation of TLogs, thus removing the 10min gap to do
so.

…yed.

`peekTracker` was held on the Shared TLog (TLogData), whereas peeks are
received and replied to as part of a TLog instance (LogData).  When a
peek was received on a TLog, it was registered into peekTracker along
with the ReplyPromise.  If the TLog was then removed as part of a
no-longer-needed generation of TLogs, there is nothing left to reply to
the request, but by holding onto the ReplyPromise in peekTracker, we
leave the remote end with an expectation that we will reply.  Then,
10min later, peekTrackerCleanup runs and finally times out the peek
cursor, thus preventing FDB from being completely stuck.

Now, each TLog generation has its own `peekTracker`, and when a TLog is
destroyed, it times out all of the pending peek curors that are still
expecting a response.  This will then trigger the client to re-issue
them to the next generation of TLogs, thus removing the 10min gap to do
so.
@alexmiller-apple
Copy link
Contributor Author

test this please

1 similar comment
@alexmiller-apple
Copy link
Contributor Author

test this please

@alexmiller-apple
Copy link
Contributor Author

@fdb-build, test bindings please as a test of your configuration

1 similar comment
@alexmiller-apple
Copy link
Contributor Author

@fdb-build, test bindings please as a test of your configuration

And refactor some code to make adding more TLogVersions easier.
@alexmiller-apple
Copy link
Contributor Author

Performance test says that even with this code, the ~10min random break still exists. So the behavior here is still a bug, but apparently not the one I was chasing.

@alexmiller-apple
Copy link
Contributor Author

I logged all the times that we send a timed_out() to a peek cursor, and:

  1. There are indeed an overwhelming number of in-flight peeks that now get more promptly timed out when the TLog is destructed
  2. There's still a large number of cleanupPeekTrackers induced timed_out()s, that is still the cause of what I'm seeing and I don't have an explanation for.

I'll leave correctness churning on this overnight, but this PR does seem to be an improvement even though it's not a fix.

There were error cases that would cause a peek to terminate early or be
cancelled without sending anything to the next peek in line.  We would
thus end up with the first peek in a sequence waiting on its future, and
nothing that exists that would send to that future.
@alexmiller-apple
Copy link
Contributor Author

alexmiller-apple commented Jul 15, 2019

BeforeAfter
image image

There's four stages to this test. I accidentally ran them with different test durations, so to reference points from the before image's timing:

  1. Time=0 Both DCs are alive.
  2. Time=60 Only the primary DC is alive. (The orange datapoint hangs around, and is a lie.)
  3. Time=450 The remote DC brought back up.
  4. Time=600 The workload ends.

The LogIngestRate of 0 for the secondary during 450-600 was solved by #1795, which is in master, but I removed from this branch to rule it out as part of debugging.

The bug we're chasing here is that there is a gap between 1200 and 1800 where LogIngestRate for the secondary is 0. This gap has now been fixed. The minor dip is the cursors struggling to rapidly fast forward over a large number of versions, which they're not optimized to do well.

The buggify was actually incorrect and broke an invariant, which I then
fixed on the other side, but this work was actually unneeded in total.

The real issue being fixed was returnIfBlock not sending an error, as
well as the other error cases.
@alexmiller-apple
Copy link
Contributor Author

@etschannen, if there was something else I was supposed to do before this being merged, I've forgotten what it was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants