Skip to content

Conversation

@deepakverma
Copy link
Contributor

@deepakverma deepakverma commented Jun 7, 2021

PR to Auto Retry requests on a connection failure:

Requirement

During a network blip with the current implementation Se.Redis multiplexer marks requests as failed for any inflight or yet to be sent requests.
The feature is to have a SE.Redis inbuilt retriable mechanism. Se.Redis based on a retry manager will queue the requests and play them back once connection has been restored.

@deepakverma deepakverma changed the title WIP: Auto retry message on connection failure Auto retry message on connection failure Jun 21, 2021
@deepakverma deepakverma marked this pull request as ready for review June 21, 2021 17:07
@NickCraver
Copy link
Collaborator

@deepakverma if you merge main in here, should help with test stability - can you when time allows please?

# Conflicts:
#	src/StackExchange.Redis/PhysicalBridge.cs
@deepakverma
Copy link
Contributor Author

@deepakverma if you merge main in here, should help with test stability - can you when time allows please?

Great, I have merged & see at least one of the flaky tests I had failures is green now :)

@deepakverma deepakverma reopened this Jun 21, 2021
@NickCraver
Copy link
Collaborator

Will try and dig in tonight or tomorrow, but I do see SSL.Issue883_Exhaustive is for some reason failing across all environments - something's up there.

@deepakverma
Copy link
Contributor Author

Will try and dig in tonight or tomorrow, but I do see SSL.Issue883_Exhaustive is for some reason failing across all environments - something's up there.

Thanks, test is fixed now.

Copy link
Collaborator

@NickCraver NickCraver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added initial thoughts for discussion - I'm unsure about the queues but wanted review in to chat with Marc on Monday.

@NickCraver
Copy link
Collaborator

Had some more thoughts after sleeping on it: there are inherent problems that this being global yields, for example we probably don't want to re-issue REPLICAOF commands and things like that to the server, or SHUTDOWN for that matter. Which things should be retried needs to be determined, and either flagged or have some other approach.

There's one other point of confusion I thought of: we do retry a command when we get a MOVED response on cluster already today - let's just be cognizant in the final story that CommandFlags.NoRedirect remains clearly different.

@deepakverma
Copy link
Contributor Author

Had some more thoughts after sleeping on it: there are inherent problems that this being global yields, for example we probably don't want to re-issue REPLICAOF commands and things like that to the server, or SHUTDOWN for that matter. Which things should be retried needs to be determined, and either flagged or have some other approach.

There's one other point of confusion I thought of: we do retry a command when we get a MOVED response on cluster already today - let's just be cognizant in the final story that CommandFlags.NoRedirect remains clearly different.
Good point, Shutdown, replicaof or any other similar command shouldn't be retried (generally cloud providers disable this command)
for noredirect flag, I have removed the reset of no redirect flag in messageretry so that it's being honored during retry

@deepakverma deepakverma linked an issue Jul 29, 2021 that may be closed by this pull request
@deepakverma deepakverma requested a review from NickCraver July 29, 2021 05:25
@deepakverma
Copy link
Contributor Author

ah, I will be taking a look at how to resolve conflicts here tomm morning.

….Redis into StackExchange-main

# Conflicts:
#	src/StackExchange.Redis/ConnectionMultiplexer.cs
#	src/StackExchange.Redis/Message.cs
@deepakverma
Copy link
Contributor Author

I have resolved the conflicts. CommandFlags unit test is not liking the addition of the new enums, where it's failing with the duplicate enum value for prefer*. @mgravell , do you have any recommendations on how to fix it? thanks.

internal bool PushMessageForRetry(Message message)
{
bool wasEmpty;
lock (queue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems very similar to the old PhysicalBridge backlog code, though it has changed, I wonder if we should extract out a MessageBacklog class to handle both functions? Reducing duplication of bugs is usually a win, and it might even turn out to be good for contention to be doing it the lock-free way.

/// <returns></returns>
public ICommandRetryPolicy AlwaysRetryOnConnectionException()
{
return new AlwaysRetryOnConnectionException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, don't really need to new each time, could just have a static instance.

{
var task = Task.Run(ProcessRetryQueueAsync);
if (task.IsFaulted)
throw task.Exception;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw task.Exception - Won't that overwrite the exception stack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw a good helper method used for this scenario elsewhere, called 'ObserveException' or something like that.

{
lock (queue)
{
var now = Environment.TickCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see 'now' used anywhere, is it just for debugging?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed unused, but the current implementation could end up accessing Environment.TickCount n times which isn't awesome - think we should refactor that a bit.


try
{
if (messageRetryHelper.HasTimedOut(message))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we want to check for timeout of messages before we do break from the loop, otherwise timed out messages would linger in (and block?) the queue a long time. i.e. we stop dequeueing until their endpoint is available. When instead we could have timed them out and unblocked the rest of the queue.

' if (!messageRetryHelper.IsEndpointAvailable(message))
{
break;
}'

{
while (queue.Count != 0)
{
message = queue.Dequeue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, could change to TryDequeue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same < netstandard2.0 issue here :(

lock (queue)
{
var now = Environment.TickCount;
while (queue.Count != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, could change to while (queue.TryPeek)

bool createWorker = !_backlog.IsEmpty;
if (createWorker) StartBacklogProcessor();

Multiplexer.RetryQueueManager.StartRetryQueueProcessor();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are retries always in a race with the current backlog to be dequeued and written? This could cause confusing message reordering.

@TimLovellSmith
Copy link
Contributor

TimLovellSmith commented Aug 13, 2021

@deepakverma Apparently I fail at using github - I had some old comments which were sitting around in a 'review' that I had not 'published'. Were you able to see them until now? How about now? Sorry for the ones which are outdated.

@NickCraver
Copy link
Collaborator

I have a PR against this for simplifications we talked about at deepakverma#5 but it's against a fork repo so a bit hard to see...and we can't change the source branch for this PR, but opening a new one loses all discussion and great points & suggestions above. I'm not sure on best path here since @deepakverma's out this week, @TimLovellSmith should we work on those changes against my amendment branch?

I want to talk through some items like not being in love with the Func<ConnectionMultiplexer, CommandRetryPolicy>, maybe we can come up with better there but hopefully I illustrated why we need to decouple the instance itself from config options (which can be re-used). We could for example make another interface that...basically has a single method that's the func, or something else. Let's definitely talk through on Tuesday, but trying to get as many suggestions in as possible for a full call :)

@NickCraver
Copy link
Collaborator

Collapsing into #1856 to get us on a branch in this repo to collab on as discussed on call!

@NickCraver NickCraver closed this Sep 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CommandRetry onConnection Failure

4 participants