Skip to content

Conversation

@NickCraver
Copy link
Collaborator

@NickCraver NickCraver commented Jun 19, 2021

When this was refactored long ago, we went from a counter starting at 1 (which handled the first run) to a string reason (which defaults to empty). Combined with the || clause format, this subtly introduced a re-entrancy bug where a subsequent run can enter during the first run (where first was true, so no exchange for reason occurred).

This was causing several Sentinel issues due to on connect handlers setting config as nodes returned and ultimately the Set<T> triggering a ReconfigureIfNeeded....and actually running, while the original config was still in progress. This led to all sorts of race oddness especially if one endpoint was any kind of significantly different in timing than another.

Note: the other changes are only labels on the bool arguments for clarity at the call sites.

When this was refactored long ago, we went from a counter starting at 1 (which handled the first run) to a string reason (which defaults to empty). Combined with the || clause format, this subtly introduced a re-entrancy bug where a subsequent run can enter during the first run (where first was true, so no exchange for reason occurred).

This was causing several Sentinel issues due to on connect handlers setting config as nodes returned and ultimately the Set<T> triggering a ReconfigureIfNeeded....and actually running, while the original config was still in progress. This led to all sorts of race oddness especially if one endpoint was any kind of significantly different in timing than another.
@NickCraver NickCraver marked this pull request as ready for review June 19, 2021 17:51
@NickCraver
Copy link
Collaborator Author

cc @TimLovellSmith on this fix - not the only Sentinel issue but it was one of the NoConnectionAvailable sources, because a re-entrancy race improperly set both nodes as master yielding an unselectable replica in the DemandReplica path. In reality, this probably caused far more production connection problems.

@NickCraver NickCraver merged commit a767cae into main Jun 19, 2021
@NickCraver NickCraver deleted the craver/reconfigure-fix-for-reentrancy branch June 19, 2021 17:55
NickCraver added a commit that referenced this pull request Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants