[DO NOT MERGE] TestAPISwarmLeaderElection: debug by kolyshkin · Pull Request #37833 · moby/moby

kolyshkin · 2018-09-13T02:31:21Z

(summary of my findings based on debug obtained)

Here's the test workflow leading to a failure:

Start three swarm nodes (d1, d2, d3), sequentially.
Check that d1 is the leader.
Stop d1.
Wait until either d2 or d3 become a leader, by calling /nodes/<ID> every 100ms for both of them repeatedly, until ManagerStatus.Leader is true for any of these two.

One of the calls to /nodes/<ID> times out after 20s, probably meaning it is stuck somewhere in (*Cluster).GetNode() (source file daemon/cluster/nodes.go).

Also note that cherry-picking moby/swarmkit#2744 didn't help.

kolyshkin · 2018-09-13T04:22:07Z

OK now we know

it fails on the first stage of the test, here:

moby/internal/test/daemon/node.go

Lines 25 to 26 in 03e089e

    
           node, _, err := cli.NodeInspectWithRaw(context.Background(), id) 
        
           assert.NilError(t, err)

called from

moby/integration-cli/docker_api_swarm_test.go

Line 319 in 03e089e

if d.GetNode(c, d.NodeID()).ManagerStatus.Leader {

called from

moby/integration-cli/docker_api_swarm_test.go

Lines 334 to 335 in 03e089e

    
           // wait for an election to occur 
        
           waitAndAssert(c, defaultReconciliationTimeout, checkLeader(d2, d3), checker.True)

increasing timeout won't help

an excerpt from the logs:

01:13:30.862 [d1b57b00693f0] exiting daemon
01:13:30.862 Waiting for election to occur...
01:13:30.862 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
01:13:30.862 waited for 20.001218586s (out of 30s)

codecov · 2018-09-13T04:26:20Z

Codecov Report

❗ No coverage uploaded for pull request base (master@c77cfbf). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #37833   +/-   ##
=========================================
  Coverage          ?   36.09%           
=========================================
  Files             ?      610           
  Lines             ?    45143           
  Branches          ?        0           
=========================================
  Hits              ?    16294           
  Misses            ?    26609           
  Partials          ?     2240

kolyshkin · 2018-09-13T22:07:20Z

rebase to re-CI 🐕

kolyshkin · 2018-09-13T23:49:16Z

Better debug (from https://jenkins.dockerproject.org/job/Docker-PRs-experimental/42089/console):

01:07:14.476 ----------------------------------------------------------------------
01:07:14.476 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
01:07:14.476
01:07:14.476 [d3a785812b3a5] waiting for daemon to start
01:07:14.476 [d3a785812b3a5] daemon started
01:07:14.476
01:07:14.476 [dc968556f08ad] waiting for daemon to start
01:07:14.476 [dc968556f08ad] daemon started
01:07:14.476
01:07:14.476 [d6e7fc83aa5d0] waiting for daemon to start
01:07:14.477 [d6e7fc83aa5d0] daemon started
01:07:14.477
01:07:14.477 [d3a785812b3a5] exiting daemon
01:07:14.477 Waiting for election to occur...
01:07:14.477 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded: [dc968556f08ad] (*Daemon).GetNode: NodeInspectWithRaw("o3kiirubf8077k4f5fyrka8f1") failed
01:07:14.477 waited for 20.00151239s (out of 30s)
01:07:14.477 [dc968556f08ad] exiting daemon
01:07:14.477 [d6e7fc83aa5d0] exiting daemon

kolyshkin · 2018-09-14T05:17:01Z

The second node can't connect to the first one, although the first one seems fine.

First node (which is the manager, d3a785812b3a5):

time="2018-09-13T23:13:30.126000753Z" level=debug msg="form data: {"AdvertiseAddr":"","AutoLockManagers":false,"Availability":"","DataPathAddr":"","DefaultAddrPool":null,"ForceNewCluster":false,"ListenAddr":"0.0.0.0:2477","Spec":{"CAConfig":{},"Dispatcher":{},"EncryptionConfig":{"AutoLockManagers":false},"Labels":null,"Orchestration":{},"Raft":{"ElectionTick":0,"HeartbeatTick":0},"TaskDefaults":{}},"SubnetSize":0}"
time="2018-09-13T23:13:30.156305024Z" level=info msg="Listening for connections" addr="[::]:2477" module=node node.id=latz8cgh5l4og5ybp9fml0gkz proto=tcp

Second node, which eventually fails:

time="2018-09-13T23:13:33.270277003Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {127.0.0.1:2477 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2477: connect: connection refused". Reconnecting..." module=grpc
...

Still I don't possess enough swarm knowledge to figure out what is going on :(

kolyshkin · 2018-09-15T01:43:19Z

@dperny do you have any idea about why this might fail this way?

kolyshkin · 2018-09-15T02:27:05Z

The second node can't connect to the first one, although the first one seems fine.

This was all wrong of course, since we have stopped the first daemon, this is why others can't connect to it, and this is not the source of the failure.

Can it happen because swarmRequestTimeout (20s) is less than maxReconnectDelay (30s)?

kolyshkin · 2018-09-18T07:39:36Z

The fix from moby/swarmkit#2744 didn't help; same issue:

01:22:15.929
01:22:15.929 ----------------------------------------------------------------------
01:22:15.930 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
01:22:15.930
01:22:15.930 [d5edde61dee1f] waiting for daemon to start
01:22:15.930 [d5edde61dee1f] daemon started
01:22:15.930
01:22:15.930 [dce5ddeb80d23] waiting for daemon to start
01:22:15.930 [dce5ddeb80d23] daemon started
01:22:15.930
01:22:15.930 [d2967d8955d2d] waiting for daemon to start
01:22:15.930 [d2967d8955d2d] daemon started
01:22:15.930
01:22:15.930 [d5edde61dee1f] exiting daemon
01:22:15.930 Waiting for election to occur...
01:22:15.931 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded: [dce5ddeb80d23] (*Daemon).GetNode: NodeInspectWithRaw("ywro1y488c8mry36jd6gdn4ve") failed
01:22:15.931 waited for 20.00124227s (out of 30s)
01:22:15.931 [dce5ddeb80d23] exiting daemon
01:22:15.931 [d2967d8955d2d] exiting daemon
01:22:40.245
01:22:40.245 ----------------------------------------------------------------------

dperny · 2018-09-21T16:56:42Z

Are you running this test and getting a failure locally? If so, how are you doing it? Also, is this flaky, or consistently failing?

kolyshkin · 2018-09-21T20:57:56Z

Are you running this test and getting a failure locally?

Nope, I was not able to repro this locally, only on Moby CI. So what I was doing is adding debug and waiting for PR's CI to fail, when analyse test logs and daemon logs (from CI bundle tarball).

Also, is this flaky, or consistently failing?

It is flaky, but it feels like it is currently #1 in the "most frequently failing test". From the top of my head it fails on a PR with 50-80% probability.

The original issue about this test failing is here: #32673, and at the bottom of that page you can find some other PRs in which this test was failing.

kolyshkin · 2018-09-22T00:32:16Z

From the top of my head it fails on a PR with 50-80% probability.

What I meant here, is there is 50-60% chance that a single CI run on any given PR will result in this test failing. The CI run consists of 4 linux systems (two x64, s390 and power) and 1 windows system.

dperny · 2018-10-01T19:04:01Z

i am coming back around to investigate this now. i'll let you know as soon as i have something.

kolyshkin · 2018-10-01T22:01:57Z

rebased to current master

kolyshkin · 2018-10-01T23:54:52Z

aaaand the test is passing now (from what I can see from the last 5-7 PRs, in all other PRs, too). Could it be because of some recent merges? Or CI infrastructure changes? Or moon phase?

In any case, it appears to not be failing recently. I'll keep looking at it.

dperny · 2018-10-02T17:35:33Z

classic heisenbug

kolyshkin · 2018-10-03T16:31:30Z

If I could repro it locally I'd do git bisect to figure out what exactly suddenly "fixed" it, alas it is only reproducible on Travis. I am going to keep triggering a few more CI runs, say up to 10, trying to repro.

kolyshkin · 2018-10-03T16:37:12Z

@dperny just saw this test failed on another PR CI

#37958 CI running on power
Logs here: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/11623/console
Bundle here: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/11623/artifact/bundles.tar.gz

kolyshkin · 2018-10-04T17:18:47Z

@dperny got a failure on ppc:

01:27:00.837
01:27:00.837 ----------------------------------------------------------------------
01:27:00.837 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
01:27:00.837
01:27:00.837 [d8286f7d756c4] waiting for daemon to start
01:27:00.837 [d8286f7d756c4] daemon started
01:27:00.837
01:27:00.837 [d3f46a740a57e] waiting for daemon to start
01:27:00.837 [d3f46a740a57e] daemon started
01:27:00.837
01:27:00.837 [dc744066461da] waiting for daemon to start
01:27:00.838 [dc744066461da] daemon started
01:27:00.838
01:27:00.838 [d8286f7d756c4] exiting daemon
01:27:00.838 Waiting for election to occur...
01:27:00.838 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded: [d3f46a740a57e] (*Daemon).GetNode: NodeInspectWithRaw("c79apwb38k2lf8j0j205phjxh") failed
01:27:00.838 waited for 20.002914211s (out of 30s)
01:27:00.838 [d3f46a740a57e] exiting daemon
01:27:00.838 [dc744066461da] exiting daemon
01:27:17.867
01:27:17.867 ----------------------------------------------------------------------

Test logs here: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/11630/console
Bundle tarball (with daemon logs etc) here: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/11630/

...... Signed-off-by: Kir Kolyshkin <[email protected]>

[v2: skip unit tests] [v3: skip building integration tests] [v4: make it really verbose] Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2018-10-05T20:51:03Z

As requested by @dperny, make test results verbose even if they not fail, and limited tests to run to TestAPISwarm* in integration-cli.

1. Using MNT_FORCE flag does not make sense for nsfs. Using MNT_DETACH though might help. 2. When -check.vv is added to TESTFLAGS, there are a lot of messages like this one: > unmount of /tmp/dxr/d847fd103a4ba/netns failed: invalid argument and some like > unmount of /tmp/dxr/dd245af642d94/netns failed: no such file or directory The first one means directory is not a mount point, the second one means it's gone. Do ignore both of these. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2018-10-30T05:29:52Z

Some previous findings on TestAPISwarmLeaderElection are in #37833

thaJeztah · 2019-07-14T19:52:01Z

internal/test/daemon/daemon_unix.go

 	netnsPath := filepath.Join(execRoot, "netns")
 	filepath.Walk(netnsPath, func(path string, info os.FileInfo, err error) error {
-		if err := unix.Unmount(path, unix.MNT_FORCE); err != nil {
+		if err := unix.Unmount(path, unix.MNT_DETACH); err != nil && err != unix.EINVAL && err != unix.ENOENT {


Looks like this change was already merged in 73baee2 (as part of #38127)

thaJeztah · 2019-07-14T19:52:48Z

Closing this one; looks like the changes that weren't merged yet are just for debugging

kolyshkin requested a review from vdemeester as a code owner September 13, 2018 02:31

GordonTheTurtle added the status/0-triage label Sep 13, 2018

kolyshkin mentioned this pull request Sep 13, 2018

vendor: bump etcd to v3.3.9 #37805

Merged

kolyshkin force-pushed the fix-test-swarm-leader-election branch from f1ee5d2 to c396db4 Compare September 13, 2018 04:26

kolyshkin force-pushed the fix-test-swarm-leader-election branch 2 times, most recently from 02b85fc to 5ff456c Compare September 13, 2018 22:06

kolyshkin mentioned this pull request Sep 15, 2018

TestServiceWithDefaultAddressPoolInit #37836

Merged

kolyshkin force-pushed the fix-test-swarm-leader-election branch from 3a7f69b to 78518f4 Compare September 18, 2018 05:37

thaJeztah added status/2-code-review area/testing area/swarm and removed status/0-triage labels Sep 18, 2018

kolyshkin changed the title ~~[DO NOT MERGE] TestAPISwarmLeaderElection: try to fix~~ [DO NOT MERGE] TestAPISwarmLeaderElection: debug Sep 20, 2018

GordonTheTurtle assigned coolljt0725 Sep 28, 2018

kolyshkin force-pushed the fix-test-swarm-leader-election branch from 78518f4 to 903ae64 Compare October 1, 2018 21:58

kolyshkin force-pushed the fix-test-swarm-leader-election branch from 903ae64 to 4784c4f Compare October 2, 2018 17:31

kolyshkin mentioned this pull request Oct 2, 2018

Flaky test TestSwarmContainer{AttachByNetworkId|AutoStart|EndpointOptions} #29663

Open

kolyshkin force-pushed the fix-test-swarm-leader-election branch 4 times, most recently from 73509ea to f65888d Compare October 3, 2018 06:31

kolyshkin mentioned this pull request Oct 3, 2018

TestSwarmContainerEndpointOptions: fix debug #37958

Merged

kolyshkin force-pushed the fix-test-swarm-leader-election branch from f65888d to aa2e6ca Compare October 3, 2018 16:28

kolyshkin requested a review from tianon as a code owner October 5, 2018 16:16

TestAPISwarmLeaderElection: add some debug

f139b09

...... Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin force-pushed the fix-test-swarm-leader-election branch 2 times, most recently from b19716f to b742cc2 Compare October 5, 2018 16:47

debug: only run TestAPISwarm* tests

38ecb55

[v2: skip unit tests] [v3: skip building integration tests] [v4: make it really verbose] Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin force-pushed the fix-test-swarm-leader-election branch from b742cc2 to 38ecb55 Compare October 5, 2018 17:50

kolyshkin mentioned this pull request Oct 30, 2018

[DO NOT MERGE] Swarm tests debug #38080

Closed

This was referenced Nov 7, 2018

Bump SwarmKit to 8d8689d5a94ac42406883a4cef89b3a5eaec3d11 #38123

Merged

[18.09 backport] API: properly handle invalid JSON to return a 400 status docker-archive/engine#110

Merged

thaJeztah mentioned this pull request Dec 10, 2018

update containerd to v1.2.1 #38327

Merged

derek bot added the invalid label Dec 22, 2018

thaJeztah mentioned this pull request Feb 12, 2019

[18.09 backport] pkg/archive: fix TestTarUntarWithXattr failure on recent kernel docker-archive/engine#150

Merged

thaJeztah removed the invalid label Jul 13, 2019

thaJeztah reviewed Jul 14, 2019

View reviewed changes

thaJeztah closed this Jul 14, 2019

Conversation

kolyshkin commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolyshkin commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kolyshkin commented Sep 13, 2018

Uh oh!

kolyshkin commented Sep 13, 2018

Uh oh!

kolyshkin commented Sep 14, 2018

Uh oh!

kolyshkin commented Sep 15, 2018

Uh oh!

kolyshkin commented Sep 15, 2018

Uh oh!

kolyshkin commented Sep 18, 2018

Uh oh!

dperny commented Sep 21, 2018

Uh oh!

kolyshkin commented Sep 21, 2018

Uh oh!

kolyshkin commented Sep 22, 2018

Uh oh!

dperny commented Oct 1, 2018

Uh oh!

kolyshkin commented Oct 1, 2018

Uh oh!

kolyshkin commented Oct 1, 2018

Uh oh!

dperny commented Oct 2, 2018

Uh oh!

kolyshkin commented Oct 3, 2018

Uh oh!

kolyshkin commented Oct 3, 2018

Uh oh!

kolyshkin commented Oct 4, 2018

Uh oh!

kolyshkin commented Oct 5, 2018

Uh oh!

kolyshkin commented Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah Jul 14, 2019

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Jul 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kolyshkin commented Sep 13, 2018 •

edited

Loading

kolyshkin commented Sep 13, 2018 •

edited

Loading

codecov bot commented Sep 13, 2018 •

edited

Loading

kolyshkin commented Oct 30, 2018 •

edited

Loading