Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: TestSwarmPublishDuplicatePorts on s390 #30427

Open
ddingel opened this issue Jan 24, 2017 · 11 comments
Open

Flaky Test: TestSwarmPublishDuplicatePorts on s390 #30427

ddingel opened this issue Jan 24, 2017 · 11 comments

Comments

@ddingel
Copy link

ddingel commented Jan 24, 2017

The CI fails sometime on the TestSwarmPublishDuplicatePorts test case like during Build 1781.

23:01:28 ----------------------------------------------------------------------
23:01:28 FAIL: docker_cli_swarm_test.go:1576: DockerSwarmSuite.TestSwarmPublishDuplicatePorts
23:01:28 
23:01:28 [ddc1e2218a2e3] waiting for daemon to start
23:01:28 [ddc1e2218a2e3] daemon started
23:01:28 
23:01:28 docker_cli_swarm_test.go:1584:
23:01:28     // make sure task has been deployed.
23:01:28     waitAndAssert(c, defaultReconciliationTimeout, d.CheckActiveContainerCount, checker.Equals, 1)
23:01:28 docker_utils_test.go:1125:
23:01:28     c.Assert(v, checker, args...)
23:01:28 ... obtained int = 0
23:01:28 ... expected int = 1
23:01:28 
23:01:28 [ddc1e2218a2e3] exiting daemon
23:01:29 
23:01:29 ----------------------------------------------------------------------

Something interesting is also that: while this test case fail might happen alone, sometimes it does happen also with [Flaky test: TestServiceUpdatePort on s390 #26601](https://github.com/docker/docker/issues/26601) like during [Build 1802](https://jenkins.dockerproject.org/job/Docker%20Master%20(z)/1802/).

17:09:43 ----------------------------------------------------------------------
17:09:43 FAIL: docker_cli_service_update_test.go:14: DockerSwarmSuite.TestServiceUpdatePort
17:09:43 
17:09:43 [d2a63063bb214] waiting for daemon to start
17:09:43 [d2a63063bb214] daemon started
17:09:43 
17:09:43 docker_cli_service_update_test.go:23:
17:09:43     waitAndAssert(c, defaultReconciliationTimeout, d.CheckActiveContainerCount, checker.Equals, 1)
17:09:43 docker_utils_test.go:1125:
17:09:43     c.Assert(v, checker, args...)
17:09:43 ... obtained int = 0
17:09:43 ... expected int = 1
17:09:43 
17:09:43 [d2a63063bb214] exiting daemon
17:09:45 
17:09:45 ----------------------------------------------------------------------

On he other hand only failing TestServiceUpdatePort didn't happen for at least the last 29 builds.

@vdemeester
Copy link
Member

/cc @aaronlehmann @aboch

@tophj-ibm
Copy link
Contributor

@michael-holzheu
Copy link
Contributor

This night I tried to reproduce the issue with Docker commit 833f1f4. I ran the test 1000 times on a s390x Debian Jessie host and kernel 4.6 without any failure.
TestSwarmPublishDuplicatePorts-s390x.txt

@tophj-ibm
Copy link
Contributor

Thanks @michael-holzheu. This seems to point to another test influencing it somehow, maybe something in the swarm isn't getting cleaned up properly

@michael-holzheu
Copy link
Contributor

Maybe obvious, but looking at the logs from failed runs like 2306, 2311, 2313 or 2320 we see that it always took 31 seconds from the test before until we see the FAIL message. For example here the output of run 2320 (09:11:47 to 09:12:18 = 31 seconds):

09:11:47 PASS: docker_cli_swarm_test.go:226: DockerSwarmSuite.TestSwarmPublishAdd	1.048s
09:12:18 
09:12:18 ----------------------------------------------------------------------
09:12:18 FAIL: docker_cli_swarm_test.go:1680: DockerSwarmSuite.TestSwarmPublishDuplicatePorts

So the failure reason is the timeout:

~/docker-fork/integration-cli$ git grep "defaultReconciliationTimeout ="
docker_api_swarm_test.go:var defaultReconciliationTimeout = 30 * time.Second

@michael-holzheu
Copy link
Contributor

michael-holzheu commented Apr 7, 2017

This seems to point to another test influencing it somehow, maybe something in the swarm isn't getting cleaned up properly

@tophj-ibm : I now tried to run the full DockerSwarmSuite in a loop:

#!/bin/bash
for i in {1..1000}
do
        echo "---------------------------------------------"
        echo "TEST: $i"
        echo "---------------------------------------------"
        TESTFLAGS='-check.f DockerSwarmSuite.*'  \
        DOCKER_GRAPHDRIVER=vfs DOCKER_EXECDRIVER=native TIMEOUT="120m" \
        hack/make.sh test-integration-cli
done

12 loops have been successful, but with run 13 the test cases began to fail

---------------------------------------------
TEST: 13
---------------------------------------------
FAIL: docker_cli_prune_unix_test.go:28: DockerSwarmSuite.TestPruneNetwork
FAIL: docker_cli_service_update_test.go:14: DockerSwarmSuite.TestServiceUpdatePort
FAIL: docker_cli_swarm_test.go:317: DockerSwarmSuite.TestSwarmContainerAttachByNetworkId
FAIL: docker_cli_swarm_test.go:270: DockerSwarmSuite.TestSwarmContainerAutoStart
FAIL: docker_cli_swarm_test.go:292: DockerSwarmSuite.TestSwarmContainerEndpointOptions
FAIL: docker_cli_swarm_test.go:1588: DockerSwarmSuite.TestSwarmNetworkCreateDup
FAIL: docker_cli_swarm_test.go:1464: DockerSwarmSuite.TestSwarmNetworkIPAMOptions
FAIL: docker_cli_swarm_test.go:1680: DockerSwarmSuite.TestSwarmPublishDuplicatePorts
OOPS: 104 passed, 9 skipped, 8 FAILED
--- FAIL: Test (2399.67s)

For runs 14-23 always the following 15 tests fail:

FAIL: docker_api_swarm_test.go:822: DockerSwarmSuite.TestAPIDuplicateNetworks
FAIL: docker_api_swarm_service_test.go:27: DockerSwarmSuite.TestAPIServiceUpdatePort
FAIL: docker_api_swarm_test.go:873: DockerSwarmSuite.TestAPISwarmHealthcheckNone
FAIL: docker_cli_swarm_test.go:1759: DockerSwarmSuite.TestNetworkInspectWithDuplicateNames
FAIL: docker_cli_swarm_test.go:345: DockerSwarmSuite.TestOverlayAttachable
FAIL: docker_cli_swarm_test.go:367: DockerSwarmSuite.TestOverlayAttachableOnSwarmLeave
FAIL: docker_cli_swarm_test.go:394: DockerSwarmSuite.TestOverlayAttachableReleaseResourcesOnFailure
FAIL: docker_cli_prune_unix_test.go:28: DockerSwarmSuite.TestPruneNetwork
FAIL: docker_cli_service_update_test.go:14: DockerSwarmSuite.TestServiceUpdatePort
FAIL: docker_cli_swarm_test.go:317: DockerSwarmSuite.TestSwarmContainerAttachByNetworkId
FAIL: docker_cli_swarm_test.go:270: DockerSwarmSuite.TestSwarmContainerAutoStart
FAIL: docker_cli_swarm_test.go:292: DockerSwarmSuite.TestSwarmContainerEndpointOptions
FAIL: docker_cli_swarm_test.go:1588: DockerSwarmSuite.TestSwarmNetworkCreateDup
FAIL: docker_cli_swarm_test.go:1464: DockerSwarmSuite.TestSwarmNetworkIPAMOptions
FAIL: docker_cli_swarm_test.go:1680: DockerSwarmSuite.TestSwarmPublishDuplicatePorts
OOPS: 97 passed, 9 skipped, 15 FAILED

TestSwarmPublishDuplicatePorts-s390x-full-DockerSwarmSuite.txt

I am not sure if this observation helps us for the TestSwarmPublishDuplicatePorts problem ...

@michael-holzheu
Copy link
Contributor

Update

The last failing PR build was run 2020 (Apr 28, 2017): https://jenkins.dockerproject.org/job/Docker-PRs-s390x/2020/

jenkins-s390x-pr$ grep -w "FAIL:" * | grep DuplicatePorts
2020:FAIL: docker_cli_swarm_test.go:1724: DockerSwarmSuite.TestSwarmPublishDuplicatePorts

Since then we had 848 runs where the DuplicatePorts test was successful.

@thaJeztah
Copy link
Member

Thanks @michael-holzheu - I think we can close this one then, but ping me if I closed prematurely :D

@tophj-ibm
Copy link
Contributor

seen failing again on arm in #33892 :(

@thaJeztah
Copy link
Member

Interesting; let me reopen

@thaJeztah thaJeztah reopened this Jul 27, 2017
thaJeztah added a commit to thaJeztah/docker that referenced this issue May 20, 2018
Changes included:

- libnetwork#2147 Adding logs for ipam state
- libnetwork#2143 Fix race conditions in the overlay network driver
  - possibly addresses moby#36743 services do not start: ingress-sbox is already present
  - possibly addresses moby#30427 Flaky Test: TestSwarmPublishDuplicatePorts on s390
  - possibly addresses moby#36501 Flaky tests: Service "port" tests
- libnetwork#2142 Add wait time into xtables lock warning
- libnetwork#2135 filter xtables lock warnings when firewalld is active
- libnetwork#2140 Switch from x/net/context to context
- libnetwork#2134 Adding a recovery mechanism for a split gossip cluster

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants