Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios #2408

NickCraver · 2023-03-21T16:54:56Z

Changes

This combination PR is both fixing a GC issue (see below, and #2413 for details) and improves timeout exception. Basically if a timeout happens for a message that was in the backlog but was never sent, the user now gets a much more informative message like this:

Exception: The message timed out in the backlog attempting to send because no connection became available - Last Connection Exception: InternalFailure on 127.0.0.1:6379/Interactive, Initializing/NotStarted, last: GET, origin: ConnectedAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 500s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, last-mbeat: 0s ago, global: 0s ago, v: 2.6.99.22667, command=PING, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 1, serverEndpoint: 127.0.0.1:6379, conn-sec: n/a, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: NAMISTOU-3(SE.Redis-v2.6.99.22667), IOCP: (Busy=0,Free=1000,Min=32,Max=1000), WORKER: (Busy=2,Free=32765,Min=32,Max=32767), POOL: (Threads=18,QueuedItems=0,CompletedItems=65), v: 2.6.99.22667 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Today, this isn't intuitive especially for connections with AbortOnConnectFail set to false. What happens is a multiplexer never connects successfully, but the user just gets generic timeouts. This makes the error more specific and includes the inner exception (also as .InnerException) for more details, informing the user of a config/auth/whatever error underneath as to why things are never successfully sending.

Also adds aoc: (0|1) to the exception message for easier advice in issues (reflecting what AbortOnConnectFail is set to).

Performance

Before:

Method	Job	Runtime	Mean	Error	StdDev	Op/s	Gen 0	Gen 1	Allocated
StringSet/s	Job-LYRTQY	.NET 5.0	620.2 us	12.39 us	10.34 us	1,612.3	-	-	152 B
StringGet/s	Job-LYRTQY	.NET 5.0	580.4 us	11.02 us	9.77 us	1,723.0	-	-	152 B
'HashGetAll F+F/s'	Job-LYRTQY	.NET 5.0	1,304.5 us	16.49 us	14.62 us	766.6	7.5000	2.5000	128,290 B
'HashGetAll F+F/a'	Job-LYRTQY	.NET 5.0	1,322.6 us	25.43 us	26.12 us	756.1	7.5000	2.5000	128,477 B
StringSet/s	Job-QXGLCI	.NET Framework 4.7.2	632.3 us	12.41 us	16.98 us	1,581.6	-	-	404 B
StringGet/s	Job-QXGLCI	.NET Framework 4.7.2	598.5 us	11.93 us	20.25 us	1,670.9	-	-	393 B
'HashGetAll F+F/s'	Job-QXGLCI	.NET Framework 4.7.2	1,678.3 us	30.45 us	42.69 us	595.9	20.0000	2.8571	132,193 B
'HashGetAll F+F/a'	Job-QXGLCI	.NET Framework 4.7.2	1,680.9 us	31.20 us	27.66 us	594.9	20.0000	2.8571	132,883 B

After:

Method	Job	Runtime	Mean	Error	StdDev	Op/s	Gen 0	Gen 1	Allocated
StringSet/s	Job-ZNQCTW	.NET 5.0	601.8 us	11.65 us	16.71 us	1,661.6	-	-	152 B
StringGet/s	Job-ZNQCTW	.NET 5.0	578.7 us	11.45 us	15.28 us	1,728.1	-	-	152 B
'HashGetAll F+F/s'	Job-ZNQCTW	.NET 5.0	1,332.4 us	26.35 us	36.07 us	750.5	6.6667	-	128,289 B
'HashGetAll F+F/a'	Job-ZNQCTW	.NET 5.0	1,307.6 us	25.34 us	36.34 us	764.7	7.5000	-	128,477 B
StringSet/s	Job-SMOAIE	.NET Framework 4.7.2	642.1 us	12.68 us	27.03 us	1,557.5	-	-	379 B
StringGet/s	Job-SMOAIE	.NET Framework 4.7.2	606.3 us	11.92 us	25.14 us	1,649.3	-	-	358 B
'HashGetAll F+F/s'	Job-SMOAIE	.NET Framework 4.7.2	1,655.0 us	18.94 us	20.27 us	604.2	20.0000	3.3333	132,346 B
'HashGetAll F+F/a'	Job-SMOAIE	.NET Framework 4.7.2	1,680.6 us	17.46 us	16.33 us	595.0	20.0000	3.3333	132,977 B

Original hanging issue for anyone curious - fixed in #2413:

Hanging problems

The new tests in this PR currently exhibit an indefinite hang from teardown of infrastructure before an await completes.

To repro indefinite hangs, please comment out this code in PhysicalBridge:

- AbandonPendingBacklog(new ObjectDisposedException("Connection is being finalized"));
+ //AbandonPendingBacklog(new ObjectDisposedException("Connection is being finalized"));

(note: I didn't leave the hang repro intact because it'll hose the build servers)

Take for example an even simpler version of DisconnectAndNoReconnectThrowsConnectionExceptionAsync here:

public async Task HangExample()
{
    using var conn = ConnectionMultiplexer.Connect("localhost:6379,allowAdmin=true");
    var db = conn.GetDatabase();

    conn.AllowConnect = false; // Disconnect and don't allow re-connection
    var server = conn.GetServerSnapshot()[0];
    server.SimulateConnectionFailure(SimulatedFailureType.All);

    // This will never complete, because nothing clears the backlog and activates the async continuations. 
    // Rather than the PhysicalBridge clearing the backlog, everything gets finalized before this test method completes:
    // The ConnectionMultiplexer, ServerEndpoint, and PhysicalBridge are all torn down (we've added finalizers to each to confirm)
    await db.PingAsync();
}

If we restore the AbandonPendingBacklog above as another experiment, we can demo the Task.AsyncState referenced case:

public async Task HangExample2()
{
    using var conn = ConnectionMultiplexer.Connect("localhost:6379,allowAdmin=true");
    var db = conn.GetDatabase(asyncState: conn);

    conn.AllowConnect = false; // Disconnect and don't allow re-connection
    var server = conn.GetServerSnapshot()[0];
    server.SimulateConnectionFailure(SimulatedFailureType.All);

    object? state = null;
    try
    {
        var task = db.PingAsync();
        state = task.AsyncState;
        await task;
    }
    catch (Exception ex)
    {
        Log("Task.AsyncState (" + state?.GetType() + "): " + state?.ToString());
        Log("Exception: " + ex);
    }
}

Logs:

17:15:21.1585: Task.AsyncState (StackExchange.Redis.ConnectionMultiplexer): NAMISTOU-3(SE.Redis-v2.6.101.9028)
17:15:21.1585: Exception: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'Connection is being finalized'.
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at StackExchange.Redis.Tests.AbortOnConnectFailTests.<HangExample2>d__4.MoveNext() in C:\git\StackExchange\StackExchange.Redis\tests\StackExchange.Redis.Tests\AbortOnConnectFailTests.cs:line 77

For ref, the object hierarchy is:

ConnectionMultiplexer
- ServerEndPoint
  - PhysicalBridge (contains the unsent backlog queue)
    - PhysicalConnection
      - Socket

In the above, we've tried:

Adding a finalizer to ConnectionMultiplexer, ServerEndPoint, and PhysicalBrdige - all are getting finalized before PingAsync() completes.
GC.KeepAlive(...) to each of the above
Passing in conn to .GetDatabase() - this passes it as AsyncState all the way down through TaskCompletionSource and correctly sets it on the Task in PingAsync

Notes:

In 3 above, we can simultaneously see that Task.AsyncState is indeed the ConnectionMultiplexer, so we have a reference to it...and it's still being finalized. It seems like the async state machine isn't being correctly held onto in some way.
This doesn't reproduce in an isolated test - needs all to run presumably for GC pressure.

Aside notes:

Note also adds aoc: 1/0 to timeout messages - this'll make it easier for us to see if a user is setting abortConnect=false or not when advising them.

This'll make it easier for us to see if a user is setting abortConnect=false or not when advising them.

This is to better inform users when timeouts are happening in the backlog buffer. However, it's also exposed a race in the test suite where alone on in a small batch tests run fine. When we run everything together, then `DisconnectAndNoReconnectThrowsConnectionExceptionAsync` will hang indefinitely. But of course not debugging, we have some sort of fun race here. Note the new handling inside ProcessBacklogAsync which I observed trying to figure this out - we have 1 test that's in a disposal race with backlogged items and we need to make sure everything in there gets punted out to avoid async hangs forever. But there's still another cause of this somewhere.

…Token (#2413) Co-authored-by: Nick Craver <[email protected]>

…improvements

tests/StackExchange.Redis.Tests/AbortOnConnectFailTests.cs

NickCraver added 3 commits March 6, 2023 14:25

Add aoc: 0/1 to exception messages

8b58844

This'll make it easier for us to see if a user is setting abortConnect=false or not when advising them.

Additional fixes - test still fails

4423780

NickCraver mentioned this pull request Mar 27, 2023

How to get the second levels of RedisObject? #2377

Open

Proposed fix for GC rooting + extensive notes on how/why are in Timer…

4ce4797

…Token (#2413) Co-authored-by: Nick Craver <[email protected]>

NickCraver changed the title ~~Timeouts: Improve Backlog handling and errors for users~~ Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios Mar 29, 2023

NickCraver added 2 commits March 29, 2023 08:20

Merge branch 'main' into craver/timeout-error-improvements

fb9b754

Fix merge bits (arg dropped)

c1ddb1e

NickCraver linked an issue Mar 29, 2023 that may be closed by this pull request

Deadlock while awaiting Task from Database using disposed/closed ConnectionMultiplexer #2412

Closed

kornelpal mentioned this pull request Mar 29, 2023

Add test for multiplexer GC rooting. #2415

Closed

NickCraver added 2 commits March 30, 2023 12:35

Merge remote-tracking branch 'origin/main' into craver/timeout-error-…

f39f20c

…improvements

Fix release notes + up baseline packages for comparison

38ad8d1

NickCraver marked this pull request as ready for review March 30, 2023 16:36

NickCraver requested review from mgravell and philon-msft March 30, 2023 16:37

mgravell reviewed Mar 30, 2023

View reviewed changes

tests/StackExchange.Redis.Tests/AbortOnConnectFailTests.cs Outdated Show resolved Hide resolved

Test cleanup

1957afe

mgravell approved these changes Mar 30, 2023

View reviewed changes

NickCraver merged commit 129d59f into main Mar 30, 2023

NickCraver deleted the craver/timeout-error-improvements branch March 30, 2023 18:03

JoaoCMotaJr mentioned this pull request Jun 20, 2023

update redis version from 2.6.96 to 2.6.111 takenet/elephant#77

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios #2408

Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios #2408

Uh oh!

NickCraver commented Mar 21, 2023 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios #2408

Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios #2408

Uh oh!

Conversation

NickCraver commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance

Original hanging issue for anyone curious - fixed in #2413:

Hanging problems

Aside notes:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NickCraver commented Mar 21, 2023 •

edited

Loading