Fix parallel peek stalling for 10min when a TLog generation is destroyed #1818

alexmiller-apple · 2019-07-10T00:40:37Z

peekTracker was held on the Shared TLog (TLogData), whereas peeks are
received and replied to as part of a TLog instance (LogData). When a
peek was received on a TLog, it was registered into peekTracker along
with the ReplyPromise. If the TLog was then removed as part of a
no-longer-needed generation of TLogs, there is nothing left to reply to
the request, but by holding onto the ReplyPromise in peekTracker, we
leave the remote end with an expectation that we will reply. Then,
10min later, peekTrackerCleanup runs and finally times out the peek
cursor, thus preventing FDB from being completely stuck.

Now, each TLog generation has its own peekTracker, and when a TLog is
destroyed, it times out all of the pending peek curors that are still
expecting a response. This will then trigger the client to re-issue
them to the next generation of TLogs, thus removing the 10min gap to do
so.

…yed. `peekTracker` was held on the Shared TLog (TLogData), whereas peeks are received and replied to as part of a TLog instance (LogData). When a peek was received on a TLog, it was registered into peekTracker along with the ReplyPromise. If the TLog was then removed as part of a no-longer-needed generation of TLogs, there is nothing left to reply to the request, but by holding onto the ReplyPromise in peekTracker, we leave the remote end with an expectation that we will reply. Then, 10min later, peekTrackerCleanup runs and finally times out the peek cursor, thus preventing FDB from being completely stuck. Now, each TLog generation has its own `peekTracker`, and when a TLog is destroyed, it times out all of the pending peek curors that are still expecting a response. This will then trigger the client to re-issue them to the next generation of TLogs, thus removing the 10min gap to do so.

alexmiller-apple · 2019-07-10T00:47:43Z

test this please

alexmiller-apple · 2019-07-10T00:52:08Z

test this please

alexmiller-apple · 2019-07-10T00:59:24Z

@fdb-build, test bindings please as a test of your configuration

alexmiller-apple · 2019-07-10T01:01:32Z

@fdb-build, test bindings please as a test of your configuration

And refactor some code to make adding more TLogVersions easier.

alexmiller-apple · 2019-07-10T03:18:16Z

Performance test says that even with this code, the ~10min random break still exists. So the behavior here is still a bug, but apparently not the one I was chasing.

alexmiller-apple · 2019-07-10T09:17:39Z

I logged all the times that we send a timed_out() to a peek cursor, and:

There are indeed an overwhelming number of in-flight peeks that now get more promptly timed out when the TLog is destructed
There's still a large number of cleanupPeekTrackers induced timed_out()s, that is still the cause of what I'm seeing and I don't have an explanation for.

I'll leave correctness churning on this overnight, but this PR does seem to be an improvement even though it's not a fix.

There were error cases that would cause a peek to terminate early or be cancelled without sending anything to the next peek in line. We would thus end up with the first peek in a sequence waiting on its future, and nothing that exists that would send to that future.

alexmiller-apple · 2019-07-15T23:42:53Z

Before	After

There's four stages to this test. I accidentally ran them with different test durations, so to reference points from the before image's timing:

Time=0 Both DCs are alive.
Time=60 Only the primary DC is alive. (The orange datapoint hangs around, and is a lie.)
Time=450 The remote DC brought back up.
Time=600 The workload ends.

The LogIngestRate of 0 for the secondary during 450-600 was solved by #1795, which is in master, but I removed from this branch to rule it out as part of debugging.

The bug we're chasing here is that there is a gap between 1200 and 1800 where LogIngestRate for the secondary is 0. This gap has now been fixed. The minor dip is the cursors struggling to rapidly fast forward over a large number of versions, which they're not optimized to do well.

…ut-bug

The buggify was actually incorrect and broke an invariant, which I then fixed on the other side, but this work was actually unneeded in total. The real issue being fixed was returnIfBlock not sending an error, as well as the other error cases.

alexmiller-apple · 2019-07-19T05:26:06Z

@etschannen, if there was something else I was supposed to do before this being merged, I've forgotten what it was.

alexmiller-apple requested a review from etschannen July 10, 2019 00:40

alexmiller-apple assigned etschannen Jul 10, 2019

alexmiller-apple added 2 commits July 9, 2019 18:20

Copy the same set of changes to OldTLogServer_6_0

2c7007d

Add a TLogVersion::V4

dfbf942

And refactor some code to make adding more TLogVersions easier.

alexmiller-apple added 3 commits July 15, 2019 16:43

Copy same changes to OldTLogServer_6_0.

32af112

Merge remote-tracking branch 'upstream/master' into peek-cursor-timeo…

4cc60dc

…ut-bug

Remove buggify and unneeded safeguards.

812ce37

The buggify was actually incorrect and broke an invariant, which I then fixed on the other side, but this work was actually unneeded in total. The real issue being fixed was returnIfBlock not sending an error, as well as the other error cases.

etschannen merged commit 6d694cc into apple:master Jul 19, 2019

alexmiller-apple mentioned this pull request Oct 15, 2019

Fix the 10min multi-region recovery stall again #2242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix parallel peek stalling for 10min when a TLog generation is destroyed #1818

Fix parallel peek stalling for 10min when a TLog generation is destroyed #1818

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 15, 2019 •

edited

Loading

Uh oh!

alexmiller-apple commented Jul 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix parallel peek stalling for 10min when a TLog generation is destroyed #1818

Fix parallel peek stalling for 10min when a TLog generation is destroyed #1818

Uh oh!

Conversation

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 10, 2019

Uh oh!

alexmiller-apple commented Jul 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexmiller-apple commented Jul 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexmiller-apple commented Jul 15, 2019 •

edited

Loading