Skip to content

Fix race in TestApiSwarmRestartCluster#25077

Merged
crosbymichael merged 1 commit intomoby:masterfrom
cpuguy83:fix_TestApiSwarmRestartCluster
Jul 28, 2016
Merged

Fix race in TestApiSwarmRestartCluster#25077
crosbymichael merged 1 commit intomoby:masterfrom
cpuguy83:fix_TestApiSwarmRestartCluster

Conversation

@cpuguy83
Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 commented Jul 26, 2016

In TestApiSwarmRestartCluster, it's calling checkClusterHealth.
checkClusterHealth calls d.info(), which will return an error if
there is no cluster leader... problem is checkClusterHealth is doing a
nil error assertion w/o giving any time for a leader to be elected.

This moves the d.info() call into a waitAndAssert using the default
reconciliation timeout.

Fixes #24590

@cpuguy83
Copy link
Copy Markdown
Member Author

Found a couple of more issues with this test, working on an update.

In `TestApiSwarmRestartCluster`, it's calling `checkClusterHealth`.
`checkClusterHealth` calls `d.info()`, which will return an error if
there is no cluster leader... problem is `checkClusterHealth` is doing a
nil error assertion w/o giving any time for a leader to be elected.

This moves the `d.info()` call into a `waitAndAssert` using the default
reconciliation timeout.

It also moves some other checks into a `waitAndAssert` to give the
cluster enough time to come back up.

Signed-off-by: Brian Goff <[email protected]>
return true, nil
}
nn := d.getNode(c, n.ID)
n = *nn
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. Why is state being checked before getting the node?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we already have the node, might as well check if its already ok before getting the node again,

@dnephin
Copy link
Copy Markdown
Member

dnephin commented Jul 27, 2016

LGTM

1 similar comment
@crosbymichael
Copy link
Copy Markdown
Contributor

LGTM

@tonistiigi
Copy link
Copy Markdown
Member

Starting a test daemon blocks until the control API becomes available, this should not happen before leader has been elected. If there is a reelection(not sure why it should be in this case) but there is no error the manager calls should block also. Debugging why reelection happened in this test already pointed us to at least one swarmkit bug in moby/swarmkit#1183 .

If we don't do this every user of swarm API endpoints needs to write:

for {
  if err := apicall(); err != nil {
    if isNoLeader(err) {
      backoffDelay()
      continue
    }
    return err
  }
}

That doesn't seem to be very user-friendly.

I'm fine with patching the test for CI but we need to make sure we are not hiding actual bugs by working around them in our tests.

@cpuguy83 cpuguy83 deleted the fix_TestApiSwarmRestartCluster branch August 1, 2016 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants