Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services do not start: ingress-sbox is already present #36743

Open
tonistiigi opened this issue Mar 30, 2018 · 12 comments
Open

services do not start: ingress-sbox is already present #36743

tonistiigi opened this issue Mar 30, 2018 · 12 comments
Labels
area/networking area/swarm area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

Comments

@tonistiigi
Copy link
Member

When creating services with port options, in some cases the services do not start. In the logs it is visible that the task receives an error on startup: "container ingress-sbox is already present in sandbox ingress_sbox". Then the manager will keep retrying but always retrieves the same error.

level=error msg="fatal task error" error="container ingress-sbox is already present in sandbox ingress_sbox" module=node/agent/taskmanager node.id=ok1mq6f0r0yhmb6l7c5vplet7 service.id=ikjt6wfp2br0p5n47hk5dwl6m task.id=vb7zw9c3a7qwuq0qehn8pmg8b

One place it consistently shows is Moby CI. Both TestAPIServiceUpdatePort and TestSwarmPublishDuplicatePorts have failed ~70 times in last 2 months (full stats and links to failures: https://gist.githubusercontent.com/tonistiigi/9e91cbb968dea83113d09ce5f9dbff79/raw/17696e6fc2cce473d101f70ce6bdeb75a4f35bfd/moby_master_stats)

Examples:

https://jenkins.dockerproject.org/job/Docker%20Master/label=ubuntu-1604-aufs-stable/9755/console

17:31:28 
17:31:28 ----------------------------------------------------------------------
17:31:28 FAIL: docker_api_swarm_service_test.go:32: DockerSwarmSuite.TestAPIServiceUpdatePort
17:31:28 
17:31:28 [da70551459fe6] waiting for daemon to start
17:31:28 [da70551459fe6] daemon started
17:31:28 
17:31:28 docker_api_swarm_service_test.go:38:
17:31:28     waitAndAssert(c, defaultReconciliationTimeout, d.CheckActiveContainerCount, checker.Equals, 1)
17:31:28 docker_utils_test.go:452:
17:31:28     c.Assert(v, checker, args...)
17:31:28 ... obtained int = 0
17:31:28 ... expected int = 1
17:31:28 
17:31:28 [da70551459fe6] exiting daemon
17:31:40 
17:31:40 ----------------------------------------------------------------------

Daemon logs: https://gist.github.com/tonistiigi/6cb35ce9947cdbbcf410cf9863945220#file-docker-log-L317


https://jenkins.dockerproject.org/job/Docker%20Master/label=ubuntu-1604-aufs-stable/9755/console

17:55:13 
17:55:13 ----------------------------------------------------------------------
17:55:13 FAIL: docker_cli_swarm_test.go:1610: DockerSwarmSuite.TestSwarmPublishDuplicatePorts
17:55:13 
17:55:13 [d682c9bd5a30d] waiting for daemon to start
17:55:13 [d682c9bd5a30d] daemon started
17:55:13 
17:55:13 docker_cli_swarm_test.go:1618:
17:55:13     // make sure task has been deployed.
17:55:13     waitAndAssert(c, defaultReconciliationTimeout, d.CheckActiveContainerCount, checker.Equals, 1)
17:55:13 docker_utils_test.go:452:
17:55:13     c.Assert(v, checker, args...)
17:55:13 ... obtained int = 0
17:55:13 ... expected int = 1
17:55:13 
17:55:13 [d682c9bd5a30d] exiting daemon
17:55:16 
17:55:16 ----------------------------------------------------------------------

Daemon logs https://gist.github.com/tonistiigi/92a587084a19e388ac96ec5b096a4dd2#file-gistfile1-txt-L927

Reported previously in #30427 #36501

@fcrisciani

@tonistiigi tonistiigi added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/networking area/testing area/swarm labels Mar 30, 2018
@vieux
Copy link
Contributor

vieux commented Mar 30, 2018

@selansen can you please take a look ?

@selansen
Copy link
Contributor

Sure . will do .

@selansen
Copy link
Contributor

selansen commented May 1, 2018

I tried to reproduce this issue locally . I was successful but I dont see the error message you had mentioned. I see error like below
level=error msg="fatal task error" error="error creating overlay mount to /go/src/github.com/docker/docker/bundles/test-integration/d40ba35052eb0/root/overlay2/c853182a083420ab6d6c5192f96558b1baefcb80702bf2ae01d801b98b87cf17-init/merged: invalid argument" module=node/agent/taskmanager node.id=voqx65fq1sb4kn9na3nxlikwz service.id=w6jwwcdubpl7mqn4qgiwrowxk task.id=qwid14va6bf11qhhoz2gfbecd

I will continue to debug this issue.

@selansen
Copy link
Contributor

selansen commented May 2, 2018

The above failure is coz I run the test on my MAC. Looks like other test cases also fails on my MAC. I am trying on AWS VM and test is passing. Will see if I can reproduce it my my VM.

@selansen
Copy link
Contributor

selansen commented May 2, 2018

I tried to reproduce it on my local VM but it passes every time. I tried single test case and also tried set of test cases hoping this issue might have occurred due to cleanup issue.
with my last week CI failure testing experience, I assume this test case failure might happen due to other test case clean up issue.
Error is Ingress_sbox is already present when we create newSandBox. . From the log I Looked for if there is any delete Sandbox error or warning message but i could find none.
So I am assuming it could be due to due to clean up issue from moving from one test case to other test case. But why this is not happening consistently but seen across many CI failures , I need to dig more .

@ctelfer
Copy link
Contributor

ctelfer commented May 2, 2018

Actually, from a look at the logs cited above, the ingress creation may be a symptom of another problem. I note that the ingress network creation fails with the following error:

time="2018-03-22T17:30:58.302993358Z" level=error msg="Failed creating ingress network: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists"

(https://gist.github.com/tonistiigi/6cb35ce9947cdbbcf410cf9863945220#file-docker-log-L283) You may note that this occurs shortly before the 'already present' message. It looks like the node tries to join the network, fails due the VXLAN issue, starts to tear the network down and then tries to rejoin. I can't tell yet whether the rejoin is failing due to a race with the teardown or due to an error where the teardown isn't cleaning up properly.

Regardless, there are 2 different libnetwork bugs that we happen to be paying attention to right now that relate to the Vxlan creation failure. They are moby/libnetwork#1765 and moby/libnetwork#1800. Both have PRs that are being actively reviewed. Unfortunately, we don't really have a root cause for the former. While the PR there definitely fixes issues, it isn't certain it fixes the issue causing the bug. The main difficulty is with reproducibility as @selansen is experiencing.

@ctelfer
Copy link
Contributor

ctelfer commented May 2, 2018

To be clear, even though we will address the vxlan issues, we also need to try to root cause why the teardown/setup was causing the "ingress-sbox already present" problem.

@selansen
Copy link
Contributor

selansen commented May 2, 2018

I noticed that error message too. I feel we should merge 1800 and 1765 regardless of this issue.

@ctelfer
Copy link
Contributor

ctelfer commented May 2, 2018

Ok, after some review with @selansen I think we understand the symptoms. The cause of the failure is bug moby/libnetwork#1765.

However, this bug triggers a known secondary bug in libnetwork in the network.createLoadBalancerSandbox() call. It doesn't have a PR because I found it while working on improving the load balancing scalability and included the fix in that PR. Here is the fix (does have its own commit): moby/libnetwork@4543e14 . The issue is caused by libnetwork trying to unwind a failure using defer functions that check whether an err value is non-nil, but, failing to set that err variable when subsequent operations fail. As a result:

  1. libnetwork allocates an endpoint in the ingress network
  2. libnetwork tries to make the endpoint join the ingress sandbox
  3. join fails due to the VXLAN bug
  4. improper error handling leaves the new endpoint in the network
  5. a subsequent attempt to create the network finds the ingress-endpoint already present causing the log " container ingress-sbox is already present..."

As you can see, there is already a fix for the second bug but it is not yet merged. However, even if it were, all it does is hide the secondary symptom. The real problem is in moby/libnetwork#1765 for which we have not yet found a root cause (although we have found some potential ones: (moby/libnetwork#2143)). Nevertheless, if desired, I can try to expedite the patch for createLoadBalancerSandbox().

@selansen
Copy link
Contributor

selansen commented May 7, 2018

I have run CI locally from Friday to till now. Not even single time I came across Service create failure issue.

thaJeztah added a commit to thaJeztah/docker that referenced this issue May 20, 2018
Changes included:

- libnetwork#2147 Adding logs for ipam state
- libnetwork#2143 Fix race conditions in the overlay network driver
  - possibly addresses moby#36743 services do not start: ingress-sbox is already present
  - possibly addresses moby#30427 Flaky Test: TestSwarmPublishDuplicatePorts on s390
  - possibly addresses moby#36501 Flaky tests: Service "port" tests
- libnetwork#2142 Add wait time into xtables lock warning
- libnetwork#2135 filter xtables lock warnings when firewalld is active
- libnetwork#2140 Switch from x/net/context to context
- libnetwork#2134 Adding a recovery mechanism for a split gossip cluster

Signed-off-by: Sebastiaan van Stijn <[email protected]>
@Vanuan
Copy link

Vanuan commented Jun 27, 2018

I've just upgraded from docker-ce 17.04 to 18.05 on a linux box. Has this problem. Restarting docker daemon, reinitializing swarm, removing networks, nothing helped. Only after rebooting physical machine the problem was gone.

@ctelfer
Copy link
Contributor

ctelfer commented Jul 3, 2018

Sorry to hear that. :( I have a PR outstanding that fixes race conditions that seem to be able to lead to this condition and adds better error handling for OS failures (moby/libnetwork#2146). However, due tripping over a deadlock in that region of the code recently (moby/libnetwork#2180), I/we have been wary of merging without more thorough review. Hopefully we can come back to it soon.

(For reference, the fixes to createLoadBalancerSandbox() mentioned above are in PR review for Moby in #37372).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/swarm area/testing kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
None yet
Development

No branches or pull requests

5 participants