-
Notifications
You must be signed in to change notification settings - Fork 1.3k
CURATOR-653: fix potential double leader for LeaderLatch #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CURATOR-653: fix potential double leader for LeaderLatch #398
Conversation
|
@woaishixiaoxiao thanks for sharing your fix, |
OK. I will try it. |
e292454 to
275bbea
Compare
275bbea to
7272b85
Compare
HI i have added a unit test. please approval thanks |
|
LGTM. I also think of this change days before. cc @eolivelli @Randgalt can you also give a review? |
|
@woaishixiaoxiao can you create a JIRA ticket on https://issues.apache.org/jira/projects/CURATOR for this patch? |
tisonkun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think after #430 merged, RECONNECT no more causes reset. But it's reasonable to set leadership to false if getChildren finds itself not the leader.
|
Well. After #430 merged the test added in this patch failed. Need a closer look. |
Signed-off-by: tison <[email protected]>
|
I adjust the test to inject force |
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
@tisonkun thanks for fixing the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good. I went over the test and it does what it should do. There are only two things that require an actual change (typo in the variable and catch clause in the test implementation hiding any Exception).
About the other nit-picky comments: I'm aware that I'm nit picking here. Hence, see them as proposal rather than requests to change. :-) Nice proposal 👍
Additionally, besides reasoning the code I verified that the test fails with the fix being reverted and succeeds with the fix being included.
| volatile CountDownLatch debugResetWaitLatch = null; | ||
|
|
||
| @VisibleForTesting | ||
| volatile CountDownLatch debugRestWaitBeforeNodeDelete = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| volatile CountDownLatch debugRestWaitBeforeNodeDelete = null; | |
| volatile CountDownLatch debugResetWaitBeforeNodeDeleteLatch = null; |
There's a typo in the name. Additionally, we might want to add Latch at the end to reflect the purpose of this member analogously to the other latches.
| assertEquals(states.poll(timing.forWaiting().milliseconds(), TimeUnit.MILLISECONDS), "true"); | ||
| } | ||
| } | ||
| catch (Exception e){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we hiding thrown exceptions here? Shouldn't we expose it as part of the test run if something went wrong? 🤔 The test would succeed if the Exception is thrown in this block and caught here.
| latches.add(latch); | ||
| assertEquals(states.poll(timing.forWaiting().milliseconds(), TimeUnit.MILLISECONDS), ConnectionState.CONNECTED.name()); | ||
| if (i == 0) { | ||
| assertEquals(states.poll(timing.forWaiting().milliseconds(), TimeUnit.MILLISECONDS), "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assertEquals(states.poll(timing.forWaiting().milliseconds(), TimeUnit.MILLISECONDS), "true"); | |
| assertEquals("The first LeaderLatch instance should acquire leadership.", states.poll(timing.forWaiting().milliseconds(), TimeUnit.MILLISECONDS), "true"); |
nit: maybe adding a bit more context to this polling here to describe the test case
| } | ||
| } | ||
| timing.forWaiting().sleepABit(); | ||
| // now latch1 is leader, latch2 is not leader. latch2 listens to the ephemeral node created by latch1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Moving comments into assert messages improves the test output and still works as some kind of comment. This comment could be added to the assertTrue and assertFalse in line 284-285 below describing the currently expected state.
|
|
||
| // force latch1 and latch2 reset | ||
| latch1.reset(); | ||
| ForkJoinPool.commonPool().submit(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a comment here on why we're calling latch2.reset() in a separate thread? AFAIU, it's done to not make the test's thread block due to latch2.debugRestWaitBeforeNodeDelete. It might help readers if this is reflected in a comment here. WDYT?
| public void stateChanged(CuratorFramework client, ConnectionState newState) | ||
| { | ||
| if (newState == ConnectionState.CONNECTED) { | ||
| states.add(newState.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Adding the LeaderLatch ID (we could use for loop i) here might help further down in the test understanding the state of the queue.
| LeaderLatchListener listener = new LeaderLatchListener() { | ||
| @Override | ||
| public void isLeader() { | ||
| states.add("true"); | ||
| } | ||
|
|
||
| @Override | ||
| public void notLeader() { | ||
| states.add("false"); | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the LeaderLatch ID in the event labels (as mentioned above) might help when evaluating the queue later on in the test. But to be fair: doing the asserts on hasLeadership like it's already done below (lines 303-304) serves the same purpose. I just mention it as another idea here. 🤷
| LeaderLatch latch1 = latches.get(0); | ||
| LeaderLatch latch2 = latches.get(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LeaderLatch latch1 = latches.get(0); | |
| LeaderLatch latch2 = latches.get(1); | |
| LeaderLatch initialLeaderLatch = latches.get(0); | |
| LeaderLatch initialNonLeaderLatch = latches.get(1); |
nit: maybe, making the variable more descriptive to avoid confusion. Especially because we're switching the order here in comparison to what is described in the PR description and the corresponding ticket.
| return; | ||
| } | ||
| } | ||
| timing.forWaiting().sleepABit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of waiting here? 🤔
|
@XComp Thank you very much for the comments. Could you send a pull request based on the current patch? I don't know whether I can directly merge on the forked repository but at least I can submit the patch when you made it :) |
I created PR #436 but couldn't base it onto this PR. I handled my review comments individually to make it easier to select relevant changes. The last commit is a bigger refactoring of the test. You might want to leave that out if you think that it's too much of a change. |
Actually, you can send a pull request with base branch = https://github.com/woaishixiaoxiao/curator/tree/fixbug/LeaderLatch-double-master and head branch = your branch :) Then the pull request will be opened against https://github.com/woaishixiaoxiao/curator and that should be fine. We use this pull request. Or it's also OK since the author is inactive, you take over the whole lifecycle. |
|
It was a bit trickier but I managed to base the PR based on his branch (his branch didn't show up when I was doing it the way I usually create PRs ¯_(ツ)_/¯). woaishixiaoxiao#1. Although, I'm not sure whether that's of any help because the original author would have to merge it into his branch, wouldn't he? |
Co-authored-by: shixiaoxiao <[email protected]> Co-authored-by: tison <[email protected]>
When I use the LeaderLatch to select leader, there is a double-leader phenomenon.
The timeline is as follows: