Flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #34051

thaJeztah · 2017-07-11T04:45:56Z

Seen failing on

Note that this test was flaky before (#28997), but recently became an issue again

22:36:13 ----------------------------------------------------------------------
22:36:13 FAIL: docker_cli_swarm_test.go:1146: DockerSwarmSuite.TestSwarmLockUnlockCluster
22:36:13 
22:36:13 [d6dd4af73d088] waiting for daemon to start
22:36:13 [d6dd4af73d088] daemon started
22:36:13 
22:36:13 [d072353b4c428] waiting for daemon to start
22:36:13 [d072353b4c428] daemon started
22:36:13 
22:36:13 [d36d02feae8c0] waiting for daemon to start
22:36:13 [d36d02feae8c0] daemon started
22:36:13 
22:36:13 [d072353b4c428] exiting daemon
22:36:13 [d072353b4c428] waiting for daemon to start
22:36:13 [d072353b4c428] daemon started
22:36:13 
22:36:13 [d072353b4c428] exiting daemon
22:36:13 [d6dd4af73d088] exiting daemon
22:36:13 [d6dd4af73d088] waiting for daemon to start
22:36:13 [d6dd4af73d088] daemon started
22:36:13 
22:36:13 [d36d02feae8c0] exiting daemon
22:36:13 [d36d02feae8c0] waiting for daemon to start
22:36:13 [d36d02feae8c0] daemon started
22:36:13 
22:36:13 [d072353b4c428] waiting for daemon to start
22:36:13 [d072353b4c428] daemon started
22:36:13 
22:36:13 docker_cli_swarm_test.go:1192:
22:36:13     // d2 is now set to lock
22:36:13     checkSwarmUnlockedToLocked(c, d2)
22:36:13 docker_utils_test.go:471:
22:36:13     c.Assert(v, checker, args...)
22:36:13 ... obtained bool = false
22:36:13 ... expected bool = true
22:36:13 
22:36:13 [d6dd4af73d088] exiting daemon
22:36:13 [d072353b4c428] exiting daemon
22:36:13 [d36d02feae8c0] exiting daemon
22:36:27 
22:36:27 ----------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

thaJeztah · 2017-07-11T04:46:37Z

/cc @tophj-ibm

aaronlehmann · 2017-07-11T18:32:05Z

We think that moby/swarmkit#2281 fixes this, but it hasn't been vendored yet.

cyli · 2017-07-11T20:48:58Z

Opened #34061 to vendor that change and others into master.

tophj-ibm · 2017-10-31T14:59:48Z

NOTE: this is now failing with pending as the node is not brought up in time.

A lot of swarm tests have been failing recently (also #34993) from what appears to be the same issue, nodes restarting and getting out of sync. From experimental: https://jenkins.dockerproject.org/job/Docker-PRs-experimental/37592/consoleFull

When this node restarts, it's ahead by 1 in the term cycle so it ignores all msgvotes. It then starts it's own elections and keeps getting rejected, resulting in the "The swarm does not have a leader" error message, and eventually "swarm component could not be started before timeout was reached".

time="2017-10-31T13:08:04.690924203Z" level=info msg="6d8fa764b51bacd6 [term: 12] ignored a MsgVote message with lower term from 41d64a61a0619af2 [term: 11]" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
...
time="2017-10-31T13:08:13.690202658Z" level=info msg="6d8fa764b51bacd6 received MsgVoteResp rejection from 41d64a61a0619af2 at term 15" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690229978Z" level=info msg="6d8fa764b51bacd6 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690261259Z" level=info msg="6d8fa764b51bacd6 received MsgVoteResp rejection from 2110dc9bf48a6bda at term 15" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690278462Z" level=info msg="6d8fa764b51bacd6 [quorum:2] has received 1 MsgVoteResp votes and 2 vote rejections" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.690290003Z" level=info msg="6d8fa764b51bacd6 became follower at term 15" module=raft node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.936895212Z" level=debug msg="(*session).start" module=node/agent node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.937371357Z" level=error msg="agent: session failed" error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:13.937411632Z" level=debug msg="agent: rebuild session" module=node/agent node.id=wculof7c9k6q9hz3ttqgmz3ue
time="2017-10-31T13:08:16.670454397Z" level=error msg="swarm component could not be started before timeout was reached"

I could be wrong, but I don't think this is how this is supposed to work.
ping @aaronlehmann @nishanttotla (?)

tophj-ibm · 2017-12-04T17:29:54Z

still flaky:
https://jenkins.dockerproject.org/job/Docker-PRs/46879/consoleFull

thaJeztah · 2018-06-06T00:29:35Z

This one is failing a lot recently

cpuguy83 · 2019-07-13T00:56:55Z

time="2019-07-12T23:58:04.705869801Z" level=info msg="Daemon has completed initialization"
time="2019-07-12T23:58:04.705884211Z" level=info msg="initialized VXLAN UDP port to 4789 "
time="2019-07-12T23:58:04.705873378Z" level=debug msg="memberlist: Got bind error: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-12T23:58:04.706003308Z" level=error msg="error in agentInit: failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-12T23:58:04.706029518Z" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=127.0.0.1 Adv-addr=127.0.0.1 Data-addr= Remote-addr-list=[127.0.0.1 127.0.0.1] MTU=1500"
time="2019-07-12T23:58:04.706052555Z" level=debug msg="Control plane MTU: 1500 will initialize NetworkDB with: 1400"
time="2019-07-12T23:58:04.706083859Z" level=info msg="New memberlist node - Node:89319d3ab32e will use memberlist nodeID:f11e209984b7 with config:&{NodeID:f11e209984b7 Hostname:89319d3ab32e BindAddr:0.0.0.0 AdvertiseAddr:127.0.0.1 BindPort:0 Keys:[[199 253 49 128 178 35 174 226 7 251 15 55 61 36 91 79] [178 175 153 34 163 254 58 108 45 66 180 9 136 236 101 36] [215 243 221 106 15 9 234 156 198 125 25 53 66 234 107 164]] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2019-07-12T23:58:04.706104542Z" level=debug msg="Encryption key 1: c7fd3"
time="2019-07-12T23:58:04.706113403Z" level=debug msg="Encryption key 2: b2af9"
time="2019-07-12T23:58:04.706121591Z" level=debug msg="Encryption key 3: d7f3d"
time="2019-07-12T23:58:04.706179614Z" level=debug msg="memberlist: Got bind error: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-12T23:58:04.706204887Z" level=error msg="error in agentInit: failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"

@arkodg I assume we need to to set some flags on our test daemons to make sure they don't try to bind on the same addr here.

thaJeztah · 2019-07-13T01:08:21Z

@cpuguy83 there's a portIndex to prevent that;

moby/integration-cli/check_test.go

Lines 324 to 327 in 4866f51

    
           func (s *DockerSwarmSuite) AddDaemon(c *check.C, joinSwarm, manager bool) *daemon.Daemon { 
        
           	d := daemon.New(c, dockerBinary, dockerdBinary, 
        
           		testdaemon.WithEnvironment(testEnv.Execution), 
        
           		testdaemon.WithSwarmPort(defaultSwarmPort+s.portIndex),

But looks like it isn't protected; so we might need a lock;

moby/integration-cli/check_test.go

Line 339 in 4866f51

s.portIndex++

thaJeztah · 2019-07-13T01:16:33Z

perhaps this works: #39516

andrewhsu · 2019-08-05T17:40:57Z

This is still flaky. Running the PR checks (janky) against moby/moby master branch, I see this failing once out of the last 30 runs. From console logs of tests run against tree https://github.com/moby/moby/tree/7449ca3:

FAIL: docker_cli_swarm_test.go:1133: DockerSwarmSuite.TestSwarmLockUnlockCluster

Creating a new daemon
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] daemon started

Creating a new daemon
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

[dce99e328ee48] joining swarm manager [d8a045ebf7f1c]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
Creating a new daemon
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] daemon started

[da9526a3b2863] joining swarm manager [d8a045ebf7f1c]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
waited for 84.938µs (out of 30s)
[d8a045ebf7f1c] Stopping daemon
[d8a045ebf7f1c] exiting daemon
[d8a045ebf7f1c] Daemon stopped
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] daemon started

waited for 11.454811ms (out of 1s)
waited for 96.405µs (out of 30s)
[da9526a3b2863] Stopping daemon
[da9526a3b2863] exiting daemon
[da9526a3b2863] Daemon stopped
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] waiting for daemon to start
[da9526a3b2863] daemon started

waited for 7.277254ms (out of 1s)
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

waited for 64.373µs (out of 30s)
[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] waiting for daemon to start
[dce99e328ee48] daemon started

waited for 11.7301ms (out of 1s)
waited for 100.250811ms (out of 30s)
[d8a045ebf7f1c] Stopping daemon
[d8a045ebf7f1c] exiting daemon
[d8a045ebf7f1c] Daemon stopped
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] error pinging daemon on start: Get http://d8a045ebf7f1c.sock/_ping: context deadline exceeded
[d8a045ebf7f1c] waiting for daemon to start
[d8a045ebf7f1c] daemon started

docker_cli_swarm_test.go:1173:
    checkSwarmLockedToUnlocked(c, d)
docker_utils_test.go:435:
    c.Assert(v, checker, args...)
... obtained swarm.LocalNodeState = "pending"
... expected swarm.LocalNodeState = "active"

waited for 1.108131523s (out of 1s)
[d8a045ebf7f1c] Stopping daemon
[d8a045ebf7f1c] exiting daemon
[d8a045ebf7f1c] Daemon stopped
[dce99e328ee48] Stopping daemon
[dce99e328ee48] exiting daemon
[dce99e328ee48] Daemon stopped
[da9526a3b2863] Stopping daemon
[da9526a3b2863] exiting daemon
[da9526a3b2863] Daemon stopped

thaJeztah added area/testing status/needs-attention Calls for a collective discussion during a review session labels Jul 11, 2017

This was referenced Jul 11, 2017

Allow stopping of paused container #34027

Merged

add cluster events details in swagger.yml #34035

Merged

aaronlehmann removed the status/needs-attention Calls for a collective discussion during a review session label Jul 11, 2017

thaJeztah mentioned this issue Jul 18, 2017

Service privileges: API docs #34002

Merged

thaJeztah mentioned this issue Oct 31, 2017

Simplify Utsname string conversion #35341

Merged

thaJeztah mentioned this issue Oct 31, 2017

Bump API version to 1.35 #35343

Merged

thaJeztah added the status/needs-attention Calls for a collective discussion during a review session label Oct 31, 2017

thaJeztah mentioned this issue Oct 31, 2017

Fix grammatical problems in several annotations #35347

Merged

tophj-ibm mentioned this issue Dec 4, 2017

flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #35689

Closed

kolyshkin mentioned this issue Mar 20, 2018

Test for systemd cgroupdriver memory setting #36172

Merged

thaJeztah mentioned this issue Apr 20, 2018

refactored network integration tests to make use of swarm.CreateService #36803

Merged

thaJeztah mentioned this issue May 23, 2018

pkg/mount/TestMount: fix wrt selinux #36750

Merged

thaJeztah mentioned this issue Jun 6, 2018

Update miekg/dns to v1.0.7 #37192

Merged

thaJeztah mentioned this issue Jun 18, 2018

[epic] flaky tests #37306

Open

thaJeztah mentioned this issue Jul 6, 2018

migrate tests from integration-cli/docker_api_network_test to integration/network #37323

Closed

kolyshkin mentioned this issue Aug 9, 2018

daemon.ContainerLogs(): fix resource leak on follow #37576

Merged

thaJeztah mentioned this issue Dec 10, 2018

[18.09 backport] update containerd vendoring to v1.2.1 docker-archive/engine#129

Merged

kolyshkin mentioned this issue Dec 19, 2018

Revert "Add limit to page size used by overlay2 driver" #38297

Merged

olljanat mentioned this issue Jan 10, 2019

Run volume-tests again remote daemons as well #38481

Merged

thaJeztah mentioned this issue Jan 12, 2019

reduce flakiness of TestSwarmLockUnlockCluster and TestSwarmJoinPromoteLocked #38546

Merged

thaJeztah mentioned this issue Jul 13, 2019

DockerSwarmSuite lock portIndex to work around race #39516

Merged

This was referenced Jul 14, 2019

Flaky test: TestAPISwarmNodeDrainPause #23516

Open

Flaky test: DockerSwarmSuite.TestSwarmNetworkPluginV2 #39521

Open

andrewhsu mentioned this issue Jul 15, 2019

flaky DockerSwarmSuite tests #39499

Closed

This was referenced Jul 15, 2019

[WIP] Remove TestAPISwarmLeaderElection flaky test #39528

Closed

proposal: quarantine flaky tests #39436

Closed

thaJeztah mentioned this issue Sep 2, 2019

Testcase: Change time comparison logic. #39839

Merged

thaJeztah mentioned this issue Sep 10, 2019

[19.03 backport] Update containerd to v1.2.9 docker-archive/engine#349

Merged

andrewhsu mentioned this issue Sep 30, 2019

homedir: remove idtools and libcontainer's user package dependencies #39975

Merged

thaJeztah mentioned this issue Feb 4, 2021

Mitigates chances of uint64 overflow in Unix stats collector #41921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #34051

Flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #34051

thaJeztah commented Jul 11, 2017

thaJeztah commented Jul 11, 2017

aaronlehmann commented Jul 11, 2017 •

edited

Loading

cyli commented Jul 11, 2017

tophj-ibm commented Oct 31, 2017

tophj-ibm commented Dec 4, 2017

thaJeztah commented Jun 6, 2018

cpuguy83 commented Jul 13, 2019

thaJeztah commented Jul 13, 2019

thaJeztah commented Jul 13, 2019

andrewhsu commented Aug 5, 2019

Flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #34051

Flaky test: DockerSwarmSuite.TestSwarmLockUnlockCluster #34051

Comments

thaJeztah commented Jul 11, 2017

thaJeztah commented Jul 11, 2017

aaronlehmann commented Jul 11, 2017 • edited Loading

cyli commented Jul 11, 2017

tophj-ibm commented Oct 31, 2017

tophj-ibm commented Dec 4, 2017

thaJeztah commented Jun 6, 2018

cpuguy83 commented Jul 13, 2019

thaJeztah commented Jul 13, 2019

thaJeztah commented Jul 13, 2019

andrewhsu commented Aug 5, 2019

aaronlehmann commented Jul 11, 2017 •

edited

Loading