Skip to content

Conversation

@NickCraver
Copy link
Collaborator

@NickCraver NickCraver commented Mar 21, 2023

Changes

This combination PR is both fixing a GC issue (see below, and #2413 for details) and improves timeout exception. Basically if a timeout happens for a message that was in the backlog but was never sent, the user now gets a much more informative message like this:

Exception: The message timed out in the backlog attempting to send because no connection became available - Last Connection Exception: InternalFailure on 127.0.0.1:6379/Interactive, Initializing/NotStarted, last: GET, origin: ConnectedAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 500s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, last-mbeat: 0s ago, global: 0s ago, v: 2.6.99.22667, command=PING, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 1, serverEndpoint: 127.0.0.1:6379, conn-sec: n/a, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: NAMISTOU-3(SE.Redis-v2.6.99.22667), IOCP: (Busy=0,Free=1000,Min=32,Max=1000), WORKER: (Busy=2,Free=32765,Min=32,Max=32767), POOL: (Threads=18,QueuedItems=0,CompletedItems=65), v: 2.6.99.22667 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Today, this isn't intuitive especially for connections with AbortOnConnectFail set to false. What happens is a multiplexer never connects successfully, but the user just gets generic timeouts. This makes the error more specific and includes the inner exception (also as .InnerException) for more details, informing the user of a config/auth/whatever error underneath as to why things are never successfully sending.

Also adds aoc: (0|1) to the exception message for easier advice in issues (reflecting what AbortOnConnectFail is set to).

Performance

Before:

Method Job Runtime Mean Error StdDev Op/s Gen 0 Gen 1 Allocated
StringSet/s Job-LYRTQY .NET 5.0 620.2 us 12.39 us 10.34 us 1,612.3 - - 152 B
StringGet/s Job-LYRTQY .NET 5.0 580.4 us 11.02 us 9.77 us 1,723.0 - - 152 B
'HashGetAll F+F/s' Job-LYRTQY .NET 5.0 1,304.5 us 16.49 us 14.62 us 766.6 7.5000 2.5000 128,290 B
'HashGetAll F+F/a' Job-LYRTQY .NET 5.0 1,322.6 us 25.43 us 26.12 us 756.1 7.5000 2.5000 128,477 B
StringSet/s Job-QXGLCI .NET Framework 4.7.2 632.3 us 12.41 us 16.98 us 1,581.6 - - 404 B
StringGet/s Job-QXGLCI .NET Framework 4.7.2 598.5 us 11.93 us 20.25 us 1,670.9 - - 393 B
'HashGetAll F+F/s' Job-QXGLCI .NET Framework 4.7.2 1,678.3 us 30.45 us 42.69 us 595.9 20.0000 2.8571 132,193 B
'HashGetAll F+F/a' Job-QXGLCI .NET Framework 4.7.2 1,680.9 us 31.20 us 27.66 us 594.9 20.0000 2.8571 132,883 B

After:

Method Job Runtime Mean Error StdDev Op/s Gen 0 Gen 1 Allocated
StringSet/s Job-ZNQCTW .NET 5.0 601.8 us 11.65 us 16.71 us 1,661.6 - - 152 B
StringGet/s Job-ZNQCTW .NET 5.0 578.7 us 11.45 us 15.28 us 1,728.1 - - 152 B
'HashGetAll F+F/s' Job-ZNQCTW .NET 5.0 1,332.4 us 26.35 us 36.07 us 750.5 6.6667 - 128,289 B
'HashGetAll F+F/a' Job-ZNQCTW .NET 5.0 1,307.6 us 25.34 us 36.34 us 764.7 7.5000 - 128,477 B
StringSet/s Job-SMOAIE .NET Framework 4.7.2 642.1 us 12.68 us 27.03 us 1,557.5 - - 379 B
StringGet/s Job-SMOAIE .NET Framework 4.7.2 606.3 us 11.92 us 25.14 us 1,649.3 - - 358 B
'HashGetAll F+F/s' Job-SMOAIE .NET Framework 4.7.2 1,655.0 us 18.94 us 20.27 us 604.2 20.0000 3.3333 132,346 B
'HashGetAll F+F/a' Job-SMOAIE .NET Framework 4.7.2 1,680.6 us 17.46 us 16.33 us 595.0 20.0000 3.3333 132,977 B

Original hanging issue for anyone curious - fixed in #2413:

Hanging problems

The new tests in this PR currently exhibit an indefinite hang from teardown of infrastructure before an await completes.

To repro indefinite hangs, please comment out this code in PhysicalBridge:

- AbandonPendingBacklog(new ObjectDisposedException("Connection is being finalized"));
+ //AbandonPendingBacklog(new ObjectDisposedException("Connection is being finalized"));

(note: I didn't leave the hang repro intact because it'll hose the build servers)

Take for example an even simpler version of DisconnectAndNoReconnectThrowsConnectionExceptionAsync here:

public async Task HangExample()
{
    using var conn = ConnectionMultiplexer.Connect("localhost:6379,allowAdmin=true");
    var db = conn.GetDatabase();

    conn.AllowConnect = false; // Disconnect and don't allow re-connection
    var server = conn.GetServerSnapshot()[0];
    server.SimulateConnectionFailure(SimulatedFailureType.All);

    // This will never complete, because nothing clears the backlog and activates the async continuations. 
    // Rather than the PhysicalBridge clearing the backlog, everything gets finalized before this test method completes:
    // The ConnectionMultiplexer, ServerEndpoint, and PhysicalBridge are all torn down (we've added finalizers to each to confirm)
    await db.PingAsync();
}

If we restore the AbandonPendingBacklog above as another experiment, we can demo the Task.AsyncState referenced case:

public async Task HangExample2()
{
    using var conn = ConnectionMultiplexer.Connect("localhost:6379,allowAdmin=true");
    var db = conn.GetDatabase(asyncState: conn);

    conn.AllowConnect = false; // Disconnect and don't allow re-connection
    var server = conn.GetServerSnapshot()[0];
    server.SimulateConnectionFailure(SimulatedFailureType.All);

    object? state = null;
    try
    {
        var task = db.PingAsync();
        state = task.AsyncState;
        await task;
    }
    catch (Exception ex)
    {
        Log("Task.AsyncState (" + state?.GetType() + "): " + state?.ToString());
        Log("Exception: " + ex);
    }
}

Logs:

17:15:21.1585: Task.AsyncState (StackExchange.Redis.ConnectionMultiplexer): NAMISTOU-3(SE.Redis-v2.6.101.9028)
17:15:21.1585: Exception: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'Connection is being finalized'.
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at StackExchange.Redis.Tests.AbortOnConnectFailTests.<HangExample2>d__4.MoveNext() in C:\git\StackExchange\StackExchange.Redis\tests\StackExchange.Redis.Tests\AbortOnConnectFailTests.cs:line 77

For ref, the object hierarchy is:

  • ConnectionMultiplexer
    • ServerEndPoint
      • PhysicalBridge (contains the unsent backlog queue)
        • PhysicalConnection
          • Socket

In the above, we've tried:

  1. Adding a finalizer to ConnectionMultiplexer, ServerEndPoint, and PhysicalBrdige - all are getting finalized before PingAsync() completes.
  2. GC.KeepAlive(...) to each of the above
  3. Passing in conn to .GetDatabase() - this passes it as AsyncState all the way down through TaskCompletionSource and correctly sets it on the Task in PingAsync

Notes:

  • In 3 above, we can simultaneously see that Task.AsyncState is indeed the ConnectionMultiplexer, so we have a reference to it...and it's still being finalized. It seems like the async state machine isn't being correctly held onto in some way.
  • This doesn't reproduce in an isolated test - needs all to run presumably for GC pressure.

Aside notes:

Note also adds aoc: 1/0 to timeout messages - this'll make it easier for us to see if a user is setting abortConnect=false or not when advising them.

This'll make it easier for us to see if a user is setting abortConnect=false or not when advising them.
This is to better inform users when timeouts are happening in the backlog buffer. However, it's also exposed a race in the test suite where alone on in a small batch tests run fine. When we run everything together, then `DisconnectAndNoReconnectThrowsConnectionExceptionAsync` will hang indefinitely. But of course not debugging, we have some sort of fun race here.

Note the new handling inside ProcessBacklogAsync which I observed trying to figure this out - we have 1 test that's in a disposal race with backlogged items and we need to make sure everything in there gets punted out to avoid async hangs forever. But there's still another cause of this somewhere.
@NickCraver NickCraver changed the title Timeouts: Improve Backlog handling and errors for users Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios Mar 29, 2023
@NickCraver NickCraver marked this pull request as ready for review March 30, 2023 16:36
@NickCraver NickCraver merged commit 129d59f into main Mar 30, 2023
@NickCraver NickCraver deleted the craver/timeout-error-improvements branch March 30, 2023 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock while awaiting Task from Database using disposed/closed ConnectionMultiplexer

3 participants