Skip to content

Conversation

@thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Sep 9, 2019

Relates to #38885 Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey

This test was updated in b79adac (#39616), but is still flaky #39883 (comment);

20:24:13  FAIL: docker_cli_swarm_test.go:1333: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] daemon started
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d204a02ba4780] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  [d873d6a842829] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  docker_cli_swarm_test.go:1413:
20:24:13      c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
20:24:13  ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000934240), Stderr:[]uint8(nil)} ("exit status 1")
20:24:13  ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
20:24:13
20:24:13
20:24:13  [d6f95e679cb65] Stopping daemon
20:24:13  [d6f95e679cb65] exiting daemon
20:24:13  [d6f95e679cb65] Daemon stopped
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped

The interesting bit there is that the retry loop should have a 3 second sleep before retrying,
but looking at the failure above, the test started (and failed) within a second, which means that
a different error / output was returned.

This patch adds some additional debugging to that test to see if we can catch the reason
this test is still flaky.

@thaJeztah
Copy link
Member Author

Logs from that failing test;

d6f95e679cb65.log
d204a02ba4780.log
d873d6a842829.log

@thaJeztah
Copy link
Member Author

thaJeztah commented Sep 9, 2019

Failure on RS1 is #39857

https://ci.docker.com/public/blue/rest/organizations/jenkins/pipelines/moby/branches/PR-39885/runs/1/nodes/200/log/?start=0

[2019-09-09T20:37:36.900Z] ok  	github.com/docker/docker/daemon/logger	0.525s	coverage: 43.0% of statements
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:35Z" level=info msg="Trying to get region from EC2 Metadata"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=info msg="Log stream already exists" errorCode=ResourceAlreadyExistsException logGroupName= logStreamName= message= origError="<nil>"
[2019-09-09T20:37:36.900Z] --- FAIL: TestLogBlocking (0.02s)
[2019-09-09T20:37:36.900Z]     cloudwatchlogs_test.go:313: Expected to be able to read from stream.messages but was unable to
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg=Error
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg="Failed to put log events" errorCode=InvalidSequenceTokenException logGroupName=groupName logStreamName=streamName message="use token token" origError="<nil>"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=error msg="Failed to put log events" errorCode=DataAlreadyAcceptedException logGroupName=groupName logStreamName=streamName message="use token token" origError="<nil>"
[2019-09-09T20:37:36.900Z] time="2019-09-09T20:37:36Z" level=info msg="Data already accepted, ignoring error" errorCode=DataAlreadyAcceptedException logGroupName=groupName logStreamName=streamName message="use token token"
[2019-09-09T20:37:36.900Z] FAIL
[2019-09-09T20:37:36.900Z] coverage: 78.2% of statements
[2019-09-09T20:37:36.900Z] FAIL	github.com/docker/docker/daemon/logger/awslogs	0.630s

@thaJeztah
Copy link
Member Author

rebased; @cpuguy83 @tiborvass ptal

This test was updated in b79adac, but is still flaky;

```
20:24:13  FAIL: docker_cli_swarm_test.go:1333: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] waiting for daemon to start
20:24:13  [d6f95e679cb65] daemon started
20:24:13
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d204a02ba4780] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
20:24:13  Creating a new daemon at: /go/src/github.com/docker/docker/bundles/test-integration/3/DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  [d873d6a842829] joining swarm manager [d6f95e679cb65]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] waiting for daemon to start
20:24:13  [d204a02ba4780] daemon started
20:24:13
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] waiting for daemon to start
20:24:13  [d873d6a842829] daemon started
20:24:13
20:24:13  docker_cli_swarm_test.go:1413:
20:24:13      c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
20:24:13  ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc000934240), Stderr:[]uint8(nil)} ("exit status 1")
20:24:13  ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
20:24:13
20:24:13
20:24:13  [d6f95e679cb65] Stopping daemon
20:24:13  [d6f95e679cb65] exiting daemon
20:24:13  [d6f95e679cb65] Daemon stopped
20:24:13  [d204a02ba4780] Stopping daemon
20:24:13  [d204a02ba4780] exiting daemon
20:24:13  [d204a02ba4780] Daemon stopped
20:24:13  [d873d6a842829] Stopping daemon
20:24:13  [d873d6a842829] exiting daemon
20:24:13  [d873d6a842829] Daemon stopped
```

The interesting bit there is that the retry loop should have a 3 second sleep before retrying,
but looking at the failure above, the test started (and failed) within a second, which means that
a different error / output was returned.

This patch adds some additional debugging to that test to see if we can catch the reason
this test is still flaky.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
@thaJeztah thaJeztah force-pushed the debug_flaky_TestSwarmClusterRotateUnlockKey branch from d3ab0ee to 78d137d Compare September 19, 2019 13:01
@thaJeztah
Copy link
Member Author

rebased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants