[DO NOT MERGE] TestAPISwarmLeaderElection: debug#37833
[DO NOT MERGE] TestAPISwarmLeaderElection: debug#37833kolyshkin wants to merge 3 commits intomoby:masterfrom
Conversation
|
OK now we know
moby/internal/test/daemon/node.go Lines 25 to 26 in 03e089e called from moby/integration-cli/docker_api_swarm_test.go Line 319 in 03e089e called from moby/integration-cli/docker_api_swarm_test.go Lines 334 to 335 in 03e089e
an excerpt from the logs: |
f1ee5d2 to
c396db4
Compare
Codecov Report
@@ Coverage Diff @@
## master #37833 +/- ##
=========================================
Coverage ? 36.09%
=========================================
Files ? 610
Lines ? 45143
Branches ? 0
=========================================
Hits ? 16294
Misses ? 26609
Partials ? 2240 |
02b85fc to
5ff456c
Compare
|
rebase to re-CI 🐕 |
|
Better debug (from https://jenkins.dockerproject.org/job/Docker-PRs-experimental/42089/console): 01:07:14.476 ---------------------------------------------------------------------- |
|
The second node can't connect to the first one, although the first one seems fine. First node (which is the manager, d3a785812b3a5):
Second node, which eventually fails:
Still I don't possess enough swarm knowledge to figure out what is going on :( |
|
@dperny do you have any idea about why this might fail this way? |
This was all wrong of course, since we have stopped the first daemon, this is why others can't connect to it, and this is not the source of the failure. Can it happen because swarmRequestTimeout (20s) is less than maxReconnectDelay (30s)? |
3a7f69b to
78518f4
Compare
|
The fix from moby/swarmkit#2744 didn't help; same issue: 01:22:15.929 |
|
Are you running this test and getting a failure locally? If so, how are you doing it? Also, is this flaky, or consistently failing? |
Nope, I was not able to repro this locally, only on Moby CI. So what I was doing is adding debug and waiting for PR's CI to fail, when analyse test logs and daemon logs (from CI bundle tarball).
It is flaky, but it feels like it is currently #1 in the "most frequently failing test". From the top of my head it fails on a PR with 50-80% probability. The original issue about this test failing is here: #32673, and at the bottom of that page you can find some other PRs in which this test was failing. |
What I meant here, is there is 50-60% chance that a single CI run on any given PR will result in this test failing. The CI run consists of 4 linux systems (two x64, s390 and power) and 1 windows system. |
|
i am coming back around to investigate this now. i'll let you know as soon as i have something. |
78518f4 to
903ae64
Compare
|
rebased to current master |
|
aaaand the test is passing now (from what I can see from the last 5-7 PRs, in all other PRs, too). Could it be because of some recent merges? Or CI infrastructure changes? Or moon phase? In any case, it appears to not be failing recently. I'll keep looking at it. |
903ae64 to
4784c4f
Compare
|
classic heisenbug |
73509ea to
f65888d
Compare
f65888d to
aa2e6ca
Compare
|
If I could repro it locally I'd do git bisect to figure out what exactly suddenly "fixed" it, alas it is only reproducible on Travis. I am going to keep triggering a few more CI runs, say up to 10, trying to repro. |
|
@dperny just saw this test failed on another PR CI #37958 CI running on power |
|
@dperny got a failure on ppc: 01:27:00.837 Test logs here: https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/11630/console |
...... Signed-off-by: Kir Kolyshkin <[email protected]>
b19716f to
b742cc2
Compare
[v2: skip unit tests] [v3: skip building integration tests] [v4: make it really verbose] Signed-off-by: Kir Kolyshkin <[email protected]>
b742cc2 to
38ecb55
Compare
|
As requested by @dperny, make test results verbose even if they not fail, and limited tests to run to TestAPISwarm* in integration-cli. |
1. Using MNT_FORCE flag does not make sense for nsfs. Using MNT_DETACH though might help. 2. When -check.vv is added to TESTFLAGS, there are a lot of messages like this one: > unmount of /tmp/dxr/d847fd103a4ba/netns failed: invalid argument and some like > unmount of /tmp/dxr/dd245af642d94/netns failed: no such file or directory The first one means directory is not a mount point, the second one means it's gone. Do ignore both of these. Signed-off-by: Kir Kolyshkin <[email protected]>
|
Some previous findings on |
| netnsPath := filepath.Join(execRoot, "netns") | ||
| filepath.Walk(netnsPath, func(path string, info os.FileInfo, err error) error { | ||
| if err := unix.Unmount(path, unix.MNT_FORCE); err != nil { | ||
| if err := unix.Unmount(path, unix.MNT_DETACH); err != nil && err != unix.EINVAL && err != unix.ENOENT { |
|
Closing this one; looks like the changes that weren't merged yet are just for debugging |
(summary of my findings based on debug obtained)
Here's the test workflow leading to a failure:
/nodes/<ID>every 100ms for both of them repeatedly, untilManagerStatus.Leaderis true for any of these two.One of the calls to
/nodes/<ID>times out after 20s, probably meaning it is stuck somewhere in(*Cluster).GetNode()(source filedaemon/cluster/nodes.go).Also note that cherry-picking moby/swarmkit#2744 didn't help.