Timeouts: Improve Backlog handling and errors for users + GC rooting fixes for outstanding scenarios #2408
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
This combination PR is both fixing a GC issue (see below, and #2413 for details) and improves timeout exception. Basically if a timeout happens for a message that was in the backlog but was never sent, the user now gets a much more informative message like this:
Today, this isn't intuitive especially for connections with
AbortOnConnectFailset tofalse. What happens is a multiplexer never connects successfully, but the user just gets generic timeouts. This makes the error more specific and includes the inner exception (also as.InnerException) for more details, informing the user of a config/auth/whatever error underneath as to why things are never successfully sending.Also adds
aoc: (0|1)to the exception message for easier advice in issues (reflecting whatAbortOnConnectFailis set to).Performance
Before:
After:
Original hanging issue for anyone curious - fixed in #2413:
Hanging problems
The new tests in this PR currently exhibit an indefinite hang from teardown of infrastructure before an await completes.
To repro indefinite hangs, please comment out this code in
PhysicalBridge:(note: I didn't leave the hang repro intact because it'll hose the build servers)
Take for example an even simpler version of
DisconnectAndNoReconnectThrowsConnectionExceptionAsynchere:If we restore the
AbandonPendingBacklogabove as another experiment, we can demo theTask.AsyncStatereferenced case:Logs:
For ref, the object hierarchy is:
In the above, we've tried:
ConnectionMultiplexer,ServerEndPoint, andPhysicalBrdige- all are getting finalized beforePingAsync()completes.GC.KeepAlive(...)to each of the aboveconnto.GetDatabase()- this passes it as AsyncState all the way down through TaskCompletionSource and correctly sets it on the Task inPingAsyncNotes:
Task.AsyncStateis indeed theConnectionMultiplexer, so we have a reference to it...and it's still being finalized. It seems like the async state machine isn't being correctly held onto in some way.Aside notes:
Note also adds
aoc: 1/0to timeout messages - this'll make it easier for us to see if a user is setting abortConnect=false or not when advising them.