Skip to content

[DO NOT MERGE] TestAPISwarmLeaderElection: debug#37833

Closed
kolyshkin wants to merge 3 commits intomoby:masterfrom
kolyshkin:fix-test-swarm-leader-election
Closed

[DO NOT MERGE] TestAPISwarmLeaderElection: debug#37833
kolyshkin wants to merge 3 commits intomoby:masterfrom
kolyshkin:fix-test-swarm-leader-election

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Sep 13, 2018

(summary of my findings based on debug obtained)

Here's the test workflow leading to a failure:

  1. Start three swarm nodes (d1, d2, d3), sequentially.
  2. Check that d1 is the leader.
  3. Stop d1.
  4. Wait until either d2 or d3 become a leader, by calling /nodes/<ID> every 100ms for both of them repeatedly, until ManagerStatus.Leader is true for any of these two.

One of the calls to /nodes/<ID> times out after 20s, probably meaning it is stuck somewhere in (*Cluster).GetNode() (source file daemon/cluster/nodes.go).

Also note that cherry-picking moby/swarmkit#2744 didn't help.

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Sep 13, 2018

OK now we know

  1. it fails on the first stage of the test, here:

node, _, err := cli.NodeInspectWithRaw(context.Background(), id)
assert.NilError(t, err)

called from

if d.GetNode(c, d.NodeID()).ManagerStatus.Leader {

called from

// wait for an election to occur
waitAndAssert(c, defaultReconciliationTimeout, checkLeader(d2, d3), checker.True)

  1. increasing timeout won't help

an excerpt from the logs:

01:13:30.862 [d1b57b00693f0] exiting daemon
01:13:30.862 Waiting for election to occur...
01:13:30.862 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
01:13:30.862 waited for 20.001218586s (out of 30s)

@kolyshkin kolyshkin force-pushed the fix-test-swarm-leader-election branch from f1ee5d2 to c396db4 Compare September 13, 2018 04:26
@codecov
Copy link

codecov bot commented Sep 13, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@c77cfbf). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #37833   +/-   ##
=========================================
  Coverage          ?   36.09%           
=========================================
  Files             ?      610           
  Lines             ?    45143           
  Branches          ?        0           
=========================================
  Hits              ?    16294           
  Misses            ?    26609           
  Partials          ?     2240

@kolyshkin kolyshkin force-pushed the fix-test-swarm-leader-election branch 2 times, most recently from 02b85fc to 5ff456c Compare September 13, 2018 22:06
@kolyshkin
Copy link
Contributor Author

rebase to re-CI 🐕

@kolyshkin
Copy link
Contributor Author

Better debug (from https://jenkins.dockerproject.org/job/Docker-PRs-experimental/42089/console):

01:07:14.476 ----------------------------------------------------------------------
01:07:14.476 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
01:07:14.476
01:07:14.476 [d3a785812b3a5] waiting for daemon to start
01:07:14.476 [d3a785812b3a5] daemon started
01:07:14.476
01:07:14.476 [dc968556f08ad] waiting for daemon to start
01:07:14.476 [dc968556f08ad] daemon started
01:07:14.476
01:07:14.476 [d6e7fc83aa5d0] waiting for daemon to start
01:07:14.477 [d6e7fc83aa5d0] daemon started
01:07:14.477
01:07:14.477 [d3a785812b3a5] exiting daemon
01:07:14.477 Waiting for election to occur...
01:07:14.477 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded: [dc968556f08ad] (*Daemon).GetNode: NodeInspectWithRaw("o3kiirubf8077k4f5fyrka8f1") failed
01:07:14.477 waited for 20.00151239s (out of 30s)
01:07:14.477 [dc968556f08ad] exiting daemon
01:07:14.477 [d6e7fc83aa5d0] exiting daemon

@kolyshkin
Copy link
Contributor Author

The second node can't connect to the first one, although the first one seems fine.

First node (which is the manager, d3a785812b3a5):

time="2018-09-13T23:13:30.126000753Z" level=debug msg="form data: {"AdvertiseAddr":"","AutoLockManagers":false,"Availability":"","DataPathAddr":"","DefaultAddrPool":null,"ForceNewCluster":false,"ListenAddr":"0.0.0.0:2477","Spec":{"CAConfig":{},"Dispatcher":{},"EncryptionConfig":{"AutoLockManagers":false},"Labels":null,"Orchestration":{},"Raft":{"ElectionTick":0,"HeartbeatTick":0},"TaskDefaults":{}},"SubnetSize":0}"
time="2018-09-13T23:13:30.156305024Z" level=info msg="Listening for connections" addr="[::]:2477" module=node node.id=latz8cgh5l4og5ybp9fml0gkz proto=tcp

Second node, which eventually fails:

time="2018-09-13T23:13:33.270277003Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {127.0.0.1:2477 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2477: connect: connection refused". Reconnecting..." module=grpc
...

Still I don't possess enough swarm knowledge to figure out what is going on :(

@kolyshkin
Copy link
Contributor Author

@dperny do you have any idea about why this might fail this way?

@kolyshkin
Copy link
Contributor Author

The second node can't connect to the first one, although the first one seems fine.

This was all wrong of course, since we have stopped the first daemon, this is why others can't connect to it, and this is not the source of the failure.

Can it happen because swarmRequestTimeout (20s) is less than maxReconnectDelay (30s)?

@kolyshkin kolyshkin force-pushed the fix-test-swarm-leader-election branch from 3a7f69b to 78518f4 Compare September 18, 2018 05:37
@kolyshkin
Copy link
Contributor Author

The fix from moby/swarmkit#2744 didn't help; same issue:

01:22:15.929
01:22:15.929 ----------------------------------------------------------------------
01:22:15.930 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
01:22:15.930
01:22:15.930 [d5edde61dee1f] waiting for daemon to start
01:22:15.930 [d5edde61dee1f] daemon started
01:22:15.930
01:22:15.930 [dce5ddeb80d23] waiting for daemon to start
01:22:15.930 [dce5ddeb80d23] daemon started
01:22:15.930
01:22:15.930 [d2967d8955d2d] waiting for daemon to start
01:22:15.930 [d2967d8955d2d] daemon started
01:22:15.930
01:22:15.930 [d5edde61dee1f] exiting daemon
01:22:15.930 Waiting for election to occur...
01:22:15.931 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded: [dce5ddeb80d23] (*Daemon).GetNode: NodeInspectWithRaw("ywro1y488c8mry36jd6gdn4ve") failed
01:22:15.931 waited for 20.00124227s (out of 30s)
01:22:15.931 [dce5ddeb80d23] exiting daemon
01:22:15.931 [d2967d8955d2d] exiting daemon
01:22:40.245
01:22:40.245 ----------------------------------------------------------------------

@kolyshkin kolyshkin changed the title [DO NOT MERGE] TestAPISwarmLeaderElection: try to fix [DO NOT MERGE] TestAPISwarmLeaderElection: debug Sep 20, 2018
@dperny
Copy link
Contributor

dperny commented Sep 21, 2018

Are you running this test and getting a failure locally? If so, how are you doing it? Also, is this flaky, or consistently failing?

@kolyshkin
Copy link
Contributor Author

Are you running this test and getting a failure locally?

Nope, I was not able to repro this locally, only on Moby CI. So what I was doing is adding debug and waiting for PR's CI to fail, when analyse test logs and daemon logs (from CI bundle tarball).

Also, is this flaky, or consistently failing?

It is flaky, but it feels like it is currently #1 in the "most frequently failing test". From the top of my head it fails on a PR with 50-80% probability.

The original issue about this test failing is here: #32673, and at the bottom of that page you can find some other PRs in which this test was failing.

@kolyshkin
Copy link
Contributor Author

From the top of my head it fails on a PR with 50-80% probability.

What I meant here, is there is 50-60% chance that a single CI run on any given PR will result in this test failing. The CI run consists of 4 linux systems (two x64, s390 and power) and 1 windows system.

@dperny
Copy link
Contributor

dperny commented Oct 1, 2018

i am coming back around to investigate this now. i'll let you know as soon as i have something.

@kolyshkin kolyshkin force-pushed the fix-test-swarm-leader-election branch from 78518f4 to 903ae64 Compare October 1, 2018 21:58
@kolyshkin
Copy link
Contributor Author

rebased to current master

@kolyshkin
Copy link
Contributor Author

aaaand the test is passing now (from what I can see from the last 5-7 PRs, in all other PRs, too). Could it be because of some recent merges? Or CI infrastructure changes? Or moon phase?

In any case, it appears to not be failing recently. I'll keep looking at it.

@kolyshkin kolyshkin force-pushed the fix-test-swarm-leader-election branch from 903ae64 to 4784c4f Compare October 2, 2018 17:31
@dperny
Copy link
Contributor

dperny commented Oct 2, 2018

classic heisenbug

@kolyshkin
Copy link
Contributor Author

If I could repro it locally I'd do git bisect to figure out what exactly suddenly "fixed" it, alas it is only reproducible on Travis. I am going to keep triggering a few more CI runs, say up to 10, trying to repro.

@kolyshkin
Copy link
Contributor Author

@dperny got a failure on ppc:

01:27:00.837
01:27:00.837 ----------------------------------------------------------------------
01:27:00.837 FAIL: docker_api_swarm_test.go:296: DockerSwarmSuite.TestAPISwarmLeaderElection
01:27:00.837
01:27:00.837 [d8286f7d756c4] waiting for daemon to start
01:27:00.837 [d8286f7d756c4] daemon started
01:27:00.837
01:27:00.837 [d3f46a740a57e] waiting for daemon to start
01:27:00.837 [d3f46a740a57e] daemon started
01:27:00.837
01:27:00.837 [dc744066461da] waiting for daemon to start
01:27:00.838 [dc744066461da] daemon started
01:27:00.838
01:27:00.838 [d8286f7d756c4] exiting daemon
01:27:00.838 Waiting for election to occur...
01:27:00.838 assertion failed: error is not nil: Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded: [d3f46a740a57e] (*Daemon).GetNode: NodeInspectWithRaw("c79apwb38k2lf8j0j205phjxh") failed
01:27:00.838 waited for 20.002914211s (out of 30s)
01:27:00.838 [d3f46a740a57e] exiting daemon
01:27:00.838 [dc744066461da] exiting daemon
01:27:17.867
01:27:17.867 ----------------------------------------------------------------------

Test logs here: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/11630/console
Bundle tarball (with daemon logs etc) here: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/11630/

@kolyshkin kolyshkin requested a review from tianon as a code owner October 5, 2018 16:16
@kolyshkin kolyshkin force-pushed the fix-test-swarm-leader-election branch 2 times, most recently from b19716f to b742cc2 Compare October 5, 2018 16:47
[v2: skip unit tests]
[v3: skip building integration tests]
[v4: make it really verbose]

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin force-pushed the fix-test-swarm-leader-election branch from b742cc2 to 38ecb55 Compare October 5, 2018 17:50
@kolyshkin
Copy link
Contributor Author

As requested by @dperny, make test results verbose even if they not fail, and limited tests to run to TestAPISwarm* in integration-cli.

1. Using MNT_FORCE flag does not make sense for nsfs. Using MNT_DETACH
though might help.

2. When -check.vv is added to TESTFLAGS, there are a lot of messages
like this one:

> unmount of /tmp/dxr/d847fd103a4ba/netns failed: invalid argument

and some like

> unmount of /tmp/dxr/dd245af642d94/netns failed: no such file or directory

The first one means directory is not a mount point, the second one
means it's gone. Do ignore both of these.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Oct 30, 2018

Some previous findings on TestAPISwarmLeaderElection are in #37833

netnsPath := filepath.Join(execRoot, "netns")
filepath.Walk(netnsPath, func(path string, info os.FileInfo, err error) error {
if err := unix.Unmount(path, unix.MNT_FORCE); err != nil {
if err := unix.Unmount(path, unix.MNT_DETACH); err != nil && err != unix.EINVAL && err != unix.ENOENT {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this change was already merged in 73baee2 (as part of #38127)

@thaJeztah
Copy link
Member

Closing this one; looks like the changes that weren't merged yet are just for debugging

@thaJeztah thaJeztah closed this Jul 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants