-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services do not start: ingress-sbox is already present #36743
Comments
@selansen can you please take a look ? |
Sure . will do . |
I tried to reproduce this issue locally . I was successful but I dont see the error message you had mentioned. I see error like below I will continue to debug this issue. |
The above failure is coz I run the test on my MAC. Looks like other test cases also fails on my MAC. I am trying on AWS VM and test is passing. Will see if I can reproduce it my my VM. |
I tried to reproduce it on my local VM but it passes every time. I tried single test case and also tried set of test cases hoping this issue might have occurred due to cleanup issue. |
Actually, from a look at the logs cited above, the ingress creation may be a symptom of another problem. I note that the ingress network creation fails with the following error:
(https://gist.github.com/tonistiigi/6cb35ce9947cdbbcf410cf9863945220#file-docker-log-L283) You may note that this occurs shortly before the 'already present' message. It looks like the node tries to join the network, fails due the VXLAN issue, starts to tear the network down and then tries to rejoin. I can't tell yet whether the rejoin is failing due to a race with the teardown or due to an error where the teardown isn't cleaning up properly. Regardless, there are 2 different libnetwork bugs that we happen to be paying attention to right now that relate to the Vxlan creation failure. They are moby/libnetwork#1765 and moby/libnetwork#1800. Both have PRs that are being actively reviewed. Unfortunately, we don't really have a root cause for the former. While the PR there definitely fixes issues, it isn't certain it fixes the issue causing the bug. The main difficulty is with reproducibility as @selansen is experiencing. |
To be clear, even though we will address the vxlan issues, we also need to try to root cause why the teardown/setup was causing the "ingress-sbox already present" problem. |
I noticed that error message too. I feel we should merge 1800 and 1765 regardless of this issue. |
Ok, after some review with @selansen I think we understand the symptoms. The cause of the failure is bug moby/libnetwork#1765. However, this bug triggers a known secondary bug in libnetwork in the network.createLoadBalancerSandbox() call. It doesn't have a PR because I found it while working on improving the load balancing scalability and included the fix in that PR. Here is the fix (does have its own commit): moby/libnetwork@4543e14 . The issue is caused by libnetwork trying to unwind a failure using defer functions that check whether an
As you can see, there is already a fix for the second bug but it is not yet merged. However, even if it were, all it does is hide the secondary symptom. The real problem is in moby/libnetwork#1765 for which we have not yet found a root cause (although we have found some potential ones: (moby/libnetwork#2143)). Nevertheless, if desired, I can try to expedite the patch for createLoadBalancerSandbox(). |
I have run CI locally from Friday to till now. Not even single time I came across Service create failure issue. |
Changes included: - libnetwork#2147 Adding logs for ipam state - libnetwork#2143 Fix race conditions in the overlay network driver - possibly addresses moby#36743 services do not start: ingress-sbox is already present - possibly addresses moby#30427 Flaky Test: TestSwarmPublishDuplicatePorts on s390 - possibly addresses moby#36501 Flaky tests: Service "port" tests - libnetwork#2142 Add wait time into xtables lock warning - libnetwork#2135 filter xtables lock warnings when firewalld is active - libnetwork#2140 Switch from x/net/context to context - libnetwork#2134 Adding a recovery mechanism for a split gossip cluster Signed-off-by: Sebastiaan van Stijn <[email protected]>
I've just upgraded from docker-ce 17.04 to 18.05 on a linux box. Has this problem. Restarting docker daemon, reinitializing swarm, removing networks, nothing helped. Only after rebooting physical machine the problem was gone. |
Sorry to hear that. :( I have a PR outstanding that fixes race conditions that seem to be able to lead to this condition and adds better error handling for OS failures (moby/libnetwork#2146). However, due tripping over a deadlock in that region of the code recently (moby/libnetwork#2180), I/we have been wary of merging without more thorough review. Hopefully we can come back to it soon. (For reference, the fixes to createLoadBalancerSandbox() mentioned above are in PR review for Moby in #37372). |
When creating services with port options, in some cases the services do not start. In the logs it is visible that the task receives an error on startup: "container ingress-sbox is already present in sandbox ingress_sbox". Then the manager will keep retrying but always retrieves the same error.
One place it consistently shows is Moby CI. Both
TestAPIServiceUpdatePort
andTestSwarmPublishDuplicatePorts
have failed ~70 times in last 2 months (full stats and links to failures: https://gist.githubusercontent.com/tonistiigi/9e91cbb968dea83113d09ce5f9dbff79/raw/17696e6fc2cce473d101f70ce6bdeb75a4f35bfd/moby_master_stats)Examples:
https://jenkins.dockerproject.org/job/Docker%20Master/label=ubuntu-1604-aufs-stable/9755/console
Daemon logs: https://gist.github.com/tonistiigi/6cb35ce9947cdbbcf410cf9863945220#file-docker-log-L317
https://jenkins.dockerproject.org/job/Docker%20Master/label=ubuntu-1604-aufs-stable/9755/console
Daemon logs https://gist.github.com/tonistiigi/92a587084a19e388ac96ec5b096a4dd2#file-gistfile1-txt-L927
Reported previously in #30427 #36501
@fcrisciani
The text was updated successfully, but these errors were encountered: