Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #34051

Open
thaJeztah opened this issue Jul 11, 2017 · 9 comments
Open

Flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #34051

thaJeztah opened this issue Jul 11, 2017 · 9 comments
Labels
area/testing status/needs-attention Calls for a collective discussion during a review session

Comments

@thaJeztah
Copy link
Member

Seen failing on

Note that this test was flaky before (#28997), but recently became an issue again

ping @cyli @aaronlehmann PTAL

22:36:13 ----------------------------------------------------------------------
22:36:13 FAIL: docker_cli_swarm_test.go:1146: DockerSwarmSuite.TestSwarmLockUnlockCluster
22:36:13 
22:36:13 [d6dd4af73d088] waiting for daemon to start
22:36:13 [d6dd4af73d088] daemon started
22:36:13 
22:36:13 [d072353b4c428] waiting for daemon to start
22:36:13 [d072353b4c428] daemon started
22:36:13 
22:36:13 [d36d02feae8c0] waiting for daemon to start
22:36:13 [d36d02feae8c0] daemon started
22:36:13 
22:36:13 [d072353b4c428] exiting daemon
22:36:13 [d072353b4c428] waiting for daemon to start
22:36:13 [d072353b4c428] daemon started
22:36:13 
22:36:13 [d072353b4c428] exiting daemon
22:36:13 [d6dd4af73d088] exiting daemon
22:36:13 [d6dd4af73d088] waiting for daemon to start
22:36:13 [d6dd4af73d088] daemon started
22:36:13 
22:36:13 [d36d02feae8c0] exiting daemon
22:36:13 [d36d02feae8c0] waiting for daemon to start
22:36:13 [d36d02feae8c0] daemon started
22:36:13 
22:36:13 [d072353b4c428] waiting for daemon to start
22:36:13 [d072353b4c428] daemon started
22:36:13 
22:36:13 docker_cli_swarm_test.go:1192:
22:36:13     // d2 is now set to lock
22:36:13     checkSwarmUnlockedToLocked(c, d2)
22:36:13 docker_utils_test.go:471:
22:36:13     c.Assert(v, checker, args...)
22:36:13 ... obtained bool = false
22:36:13 ... expected bool = true
22:36:13 
22:36:13 [d6dd4af73d088] exiting daemon
22:36:13 [d072353b4c428] exiting daemon
22:36:13 [d36d02feae8c0] exiting daemon
22:36:27 
22:36:27 ----------------------------------------------------------------------
@thaJeztah thaJeztah added area/testing status/needs-attention Calls for a collective discussion during a review session labels Jul 11, 2017
@thaJeztah
Copy link
Member Author

/cc @tophj-ibm

@aaronlehmann
Copy link
Contributor

aaronlehmann commented Jul 11, 2017

We think that moby/swarmkit#2281 fixes this, but it hasn't been vendored yet.

@cyli
Copy link
Contributor

cyli commented Jul 11, 2017

Opened #34061 to vendor that change and others into master.

@aaronlehmann aaronlehmann removed the status/needs-attention Calls for a collective discussion during a review session label Jul 11, 2017
@tophj-ibm
Copy link
Contributor

NOTE: this is now failing with pending as the node is not brought up in time.

A lot of swarm tests have been failing recently (also #34993) from what appears to be the same issue, nodes restarting and getting out of sync. From experimental: https://jenkins.dockerproject.org/job/Docker-PRs-experimental/37592/consoleFull

When this node restarts, it's ahead by 1 in the term cycle so it ignores all msgvotes. It then starts it's own elections and keeps getting rejected, resulting in the "The swarm does not have a leader" error message, and eventually "swarm component could not be started before timeout was reached".

time="2017-10-31T13:08:04.690924203Z" level=info msg="6d8fa764b51bacd6 [term: 12] ignored a MsgVote message with lower term from 41d64a61a0619af2 [term: 11]" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
...
time="2017-10-31T13:08:13.690202658Z" level=info msg="6d8fa764b51bacd6 received MsgVoteResp rejection from 41d64a61a0619af2 at term 15" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690229978Z" level=info msg="6d8fa764b51bacd6 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690261259Z" level=info msg="6d8fa764b51bacd6 received MsgVoteResp rejection from 2110dc9bf48a6bda at term 15" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690278462Z" level=info msg="6d8fa764b51bacd6 [quorum:2] has received 1 MsgVoteResp votes and 2 vote rejections" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690290003Z" level=info msg="6d8fa764b51bacd6 became follower at term 15" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.936895212Z" level=debug msg="(*session).start" module=node/agent node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.937371357Z" level=error msg="agent: session failed" error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.937411632Z" level=debug msg="agent: rebuild session" module=node/agent node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:16.670454397Z" level=error msg="swarm component could not be started before timeout was reached" 

I could be wrong, but I don't think this is how this is supposed to work.
ping @aaronlehmann @nishanttotla (?)

@thaJeztah
Copy link
Member Author

This one is failing a lot recently

@cpuguy83
Copy link
Member

time="2019-07-12T23:58:04.705869801Z" level=info msg="Daemon has completed initialization"
time="2019-07-12T23:58:04.705884211Z" level=info msg="initialized VXLAN UDP port to 4789 "
time="2019-07-12T23:58:04.705873378Z" level=debug msg="memberlist: Got bind error: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-12T23:58:04.706003308Z" level=error msg="error in agentInit: failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-12T23:58:04.706029518Z" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=127.0.0.1 Adv-addr=127.0.0.1 Data-addr= Remote-addr-list=[127.0.0.1 127.0.0.1] MTU=1500"
time="2019-07-12T23:58:04.706052555Z" level=debug msg="Control plane MTU: 1500 will initialize NetworkDB with: 1400"
time="2019-07-12T23:58:04.706083859Z" level=info msg="New memberlist node - Node:89319d3ab32e will use memberlist nodeID:f11e209984b7 with config:&{NodeID:f11e209984b7 Hostname:89319d3ab32e BindAddr:0.0.0.0 AdvertiseAddr:127.0.0.1 BindPort:0 Keys:[[199 253 49 128 178 35 174 226 7 251 15 55 61 36 91 79] [178 175 153 34 163 254 58 108 45 66 180 9 136 236 101 36] [215 243 221 106 15 9 234 156 198 125 25 53 66 234 107 164]] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2019-07-12T23:58:04.706104542Z" level=debug msg="Encryption key 1: c7fd3"
time="2019-07-12T23:58:04.706113403Z" level=debug msg="Encryption key 2: b2af9"
time="2019-07-12T23:58:04.706121591Z" level=debug msg="Encryption key 3: d7f3d"
time="2019-07-12T23:58:04.706179614Z" level=debug msg="memberlist: Got bind error: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-12T23:58:04.706204887Z" level=error msg="error in agentInit: failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"

@arkodg I assume we need to to set some flags on our test daemons to make sure they don't try to bind on the same addr here.

@thaJeztah
Copy link
Member Author

@cpuguy83 there's a portIndex to prevent that;

func (s *DockerSwarmSuite) AddDaemon(c *check.C, joinSwarm, manager bool) *daemon.Daemon {
d := daemon.New(c, dockerBinary, dockerdBinary,
testdaemon.WithEnvironment(testEnv.Execution),
testdaemon.WithSwarmPort(defaultSwarmPort+s.portIndex),

But looks like it isn't protected; so we might need a lock;

s.portIndex++

@thaJeztah
Copy link
Member Author

perhaps this works: #39516

@andrewhsu
Copy link
Member

This is still flaky. Running the PR checks (janky) against moby/moby master branch, I see this failing once out of the last 30 runs. From console logs of tests run against tree https://github.com/moby/moby/tree/7449ca3:

FAIL: docker_cli_swarm_test.go:1133: DockerSwarmSuite.TestSwarmLockUnlockCluster

Creating a new daemon
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] daemon started

Creating a new daemon
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

[dce99e328ee48] joining swarm manager [d8a045ebf7f1c]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
Creating a new daemon
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] daemon started

[da9526a3b2863] joining swarm manager [d8a045ebf7f1c]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
waited for 84.938µs (out of 30s)
[d8a045ebf7f1c] Stopping daemon
[d8a045ebf7f1c] exiting daemon
[d8a045ebf7f1c] Daemon stopped
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] daemon started

waited for 11.454811ms (out of 1s)
waited for 96.405µs (out of 30s)
[da9526a3b2863] Stopping daemon
[da9526a3b2863] exiting daemon
[da9526a3b2863] Daemon stopped
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] daemon started

waited for 7.277254ms (out of 1s)
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

waited for 64.373µs (out of 30s)
[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

waited for 11.7301ms (out of 1s)
waited for 100.250811ms (out of 30s)
[d8a045ebf7f1c] Stopping daemon
[d8a045ebf7f1c] exiting daemon
[d8a045ebf7f1c] Daemon stopped
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] daemon started

docker_cli_swarm_test.go:1173:
    checkSwarmLockedToUnlocked(c, d)
docker_utils_test.go:435:
    c.Assert(v, checker, args...)
... obtained swarm.LocalNodeState = "pending"
... expected swarm.LocalNodeState = "active"

waited for 1.108131523s (out of 1s)
[d8a045ebf7f1c] Stopping daemon
[d8a045ebf7f1c] exiting daemon
[d8a045ebf7f1c] Daemon stopped
[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
[da9526a3b2863] Stopping daemon
[da9526a3b2863] exiting daemon
[da9526a3b2863] Daemon stopped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing status/needs-attention Calls for a collective discussion during a review session
Projects
None yet
Development

No branches or pull requests

6 participants