Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix automatic removal of ingress sandbox when last service leaves #36538

Merged
merged 3 commits into from
Mar 13, 2018

Conversation

ctelfer
Copy link
Contributor

@ctelfer ctelfer commented Mar 8, 2018

- What I did
Patched libnetwork to avoid automatically deleting ingress networks when the only endpoint remaining in the network was the ingress load balancer IP. Then adjust the moby code so that it explicitly removes this endpoint when explicit ingress removal is requested. The issue this fixes is detailed in moby/libnetwork#2097.

- How I did it
Two separate patches are required. The first is to libnetwork to avoid an issue introduced in moby/libnetwork#2011 that automatically removes any network (including ingress) when the number of endpoints drops to 1 and the network contains a load balancing IP. The patch prevents the removal if the network is marked as ingress. The second patch adds logic to the daemon.releaseIngress() call in moby to remove the ingress load balancing endpoint before invoking libnetwork.network.Delete().

- How to verify it
I added a test to the moby integration tests called TestServiceWithIngressNetwork. This test:

  • spins up a swarm
  • waits for the ingress network to "settle"
  • creates a simple service and waits for it to converge
  • removes the service
  • waits and then verifies that the ingress network is still present and is not corrupted

If one applies the test patch (the first in the series) without the other two fix patches, the test will fail noting ingress corruption. If one applies the entire PR set of patches, and runs the test, it will pass. (running the test as TESTFLAGS='-test.run TestServiceWithIngressNetwork' make test-integration)

- Description for the changelog
Prevent implicit removal of the ingress network

- A picture of a cute animal (not mandatory but encouraged)

Copy link
Contributor

@mavenugo mavenugo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mavenugo
Copy link
Contributor

mavenugo commented Mar 9, 2018

@ctelfer thanks for also adding the integration-cli test for this tricky test scenario.

Copy link
Contributor

@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@selansen
Copy link
Contributor

selansen commented Mar 9, 2018

LGTM

@codecov
Copy link

codecov bot commented Mar 11, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@241c904). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master   #36538   +/-   ##
=========================================
  Coverage          ?   34.72%           
=========================================
  Files             ?      612           
  Lines             ?    45403           
  Branches          ?        0           
=========================================
  Hits              ?    15765           
  Misses            ?    27577           
  Partials          ?     2061

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctelfer SGTM, but can you reference the upstream PR on libnetwork (we'll need to wait for it to get merged to be able to merge it here)

@ctelfer
Copy link
Contributor Author

ctelfer commented Mar 12, 2018

@vdemeester Thanks for the review. At the time I filed the PR, the libnetwork change wasn't merged. I've updated the vendoring now so it references the appropriate libnetwork commit directly. CI seems to be in progress. For reference, the libnetwork PR for this fix is moby/libnetwork#2097. Thanks again!

@thaJeztah
Copy link
Member

@ctelfer looks like something went wrong with vendoring;

15:32:37 The result of vndr differs
15:32:37 
15:32:37  M vendor/github.com/docker/libnetwork/config/config.go
15:32:37  M vendor/github.com/docker/libnetwork/controller.go
15:32:37  M vendor/github.com/docker/libnetwork/drivers_ipam.go
15:32:37  M vendor/github.com/docker/libnetwork/ipams/builtin/builtin_unix.go
15:32:37  M vendor/github.com/docker/libnetwork/ipams/builtin/builtin_windows.go
15:32:37  M vendor/github.com/docker/libnetwork/ipamutils/utils.go
15:32:37  M vendor/github.com/docker/libnetwork/network.go
15:32:37  M vendor/github.com/docker/libnetwork/service_common.go
15:32:37  M vendor/github.com/docker/libnetwork/vendor.conf
15:32:37 
15:32:37 Please vendor your package with github.com/LK4D4/vndr.

@ctelfer
Copy link
Contributor Author

ctelfer commented Mar 12, 2018

facepalm Updated the vendor.conf but forgot to actually run vndr. :(

ctelfer added 3 commits March 12, 2018 15:19
The commit moby#35422 had the result of
accidentally causing the removal of the ingress network when the
last member of a service left the network.  This did not appear
in swarm instances because the swarm manager would still maintain
and return cluster state about the network even though it had
removed its sandbox and endpoint.  This test verifies that after a
service gets added and removed that the ingress sandbox remains
in a functional state.

Signed-off-by: Chris Telfer <[email protected]>
This PR prevents automatic removal of the load balancing sandbox
endpoint when the endpoint is the last one in the network but
the network is marked as ingress.

Signed-off-by: Chris Telfer <[email protected]>
Ingress networks will no longer automatically remove their
load-balancing endpoint (and sandbox) automatically when the network is
otherwise upopulated.   This is to prevent automatic removal of the
ingress networks when all the containers leave them.  Therefore
explicit removal of an ingress network also requires explicit removal
of its load-balancing endpoint.

Signed-off-by: Chris Telfer <[email protected]>
@ctelfer
Copy link
Contributor Author

ctelfer commented Mar 13, 2018

Looks like a test in the experimental tests failed. It appears to be an error in swarm teardown in TestAPISwarmRestartCluster. Is there a way to easily run just this test so I can try to reproduce it?

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐅

@thaJeztah
Copy link
Member

thaJeztah commented Mar 13, 2018

Full diff of libnetwork changes being brought in: moby/libnetwork@ed2130d...3aca383

Which includes:

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, but left some questions/notes for possible follow-ups

@@ -222,6 +222,8 @@ func (daemon *Daemon) releaseIngress(id string) {
return
}

daemon.deleteLoadBalancerSandbox(n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was looking why this didn't return errors, but looks like all errors are only logged as warnings. Perhaps (not for this PR) we should rename that function to tryDeleteLoadBalancerSandbox();

moby/daemon/network.go

Lines 505 to 534 in 0b0af85

func (daemon *Daemon) deleteLoadBalancerSandbox(n libnetwork.Network) {
controller := daemon.netController
//The only endpoint left should be the LB endpoint (nw.Name() + "-endpoint")
endpoints := n.Endpoints()
if len(endpoints) == 1 {
sandboxName := n.Name() + "-sbox"
info := endpoints[0].Info()
if info != nil {
sb := info.Sandbox()
if sb != nil {
if err := sb.DisableService(); err != nil {
logrus.Warnf("Failed to disable service on sandbox %s: %v", sandboxName, err)
//Ignore error and attempt to delete the load balancer endpoint
}
}
}
if err := endpoints[0].Delete(true); err != nil {
logrus.Warnf("Failed to delete endpoint %s (%s) in %s: %v", endpoints[0].Name(), endpoints[0].ID(), sandboxName, err)
//Ignore error and attempt to delete the sandbox.
}
if err := controller.SandboxDestroy(sandboxName); err != nil {
logrus.Warnf("Failed to delete %s sandbox: %v", sandboxName, err)
//Ignore error and attempt to delete the network.
}
}
}

  • What are the side-effects if deleting fails?
  • Should that function fail if len(endpoints) > 1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fair question and one discussed a bit in moby/libnetwork#2097. Ultimately, this function was copied into libnetwork about 6 months ago but the original never got removed. As it was, this was dead code. There is a function of the same name and behavior in libnetwork and it is not exported. It would be possible to just export said function (changing the libnetwork API, but in a small way). However, given that
1. I'm working on a patch in the near term to refactor of load balancing to improve scalability
2. This refactoring might affect how we expose said function
3. We are trying to get this regression fixed for the immanent release
It seemed more prudent to make the smaller change to fix the regression and then clean up the duplicate code in the larger refactorization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; yes it's not new code, so I'm good with keeping it for a follow-up. Just was looking at it, and it felt like a lot of error conditions were ignored, so left that comment for later investigation if needed. 👍

err = client.ServiceRemove(context.Background(), serviceID)
require.NoError(t, err)

poll.WaitOn(t, serviceIsRemoved(client, serviceID), pollSettings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we should update testenv.ProtectAll() and testenv.Clean() to also cleanup services.

// ProtectAll protects the existing environment (containers, images, networks,
// volumes, and, on Linux, plugins) from being cleaned up at the end of test
// runs
func ProtectAll(t testingT, testEnv *Execution) {
ProtectContainers(t, testEnv)
ProtectImages(t, testEnv)
ProtectNetworks(t, testEnv)
ProtectVolumes(t, testEnv)
if testEnv.OSType == "linux" {
ProtectPlugins(t, testEnv)
}
}

// Clean the environment, preserving protected objects (images, containers, ...)
// and removing everything else. It's meant to run after any tests so that they don't
// depend on each others.
func (e *Execution) Clean(t testingT) {
client := e.APIClient()
platform := e.OSType
if (platform != "windows") || (platform == "windows" && e.DaemonInfo.Isolation == "hyperv") {
unpauseAllContainers(t, client)
}
deleteAllContainers(t, client, e.protectedElements.containers)
deleteAllImages(t, client, e.protectedElements.images)
deleteAllVolumes(t, client, e.protectedElements.volumes)
deleteAllNetworks(t, client, platform, e.protectedElements.networks)
if platform == "linux" {
deleteAllPlugins(t, client, e.protectedElements.plugins)
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #36581 for that

poll.WaitOn(t, swarmIngressReady(client), pollSettings)

var instances uint64 = 1
serviceName := "TestIngressService"
Copy link
Member

@thaJeztah thaJeztah Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to get rid of names where possible (at least be sure they're unique (could use t.Name() to make them slightly more unique)).

I see there's a couple of other locations already, so also for a follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #36583 for that ^^


}

const ingressNet = "ingress"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need a test with a custom ingress network?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair question. I'm not aware of any code that tags any other networks besides a swarm-mode network named "ingress" as an ingress network. Neither am I aware of our support for such a construct. However, that may just be my lack of experience w/ the product.

My use of the named constant was more out of a general purpose aversion to magic constants sprinkled through code. (i.e. collecting them to a named token makes refactoring easier and saner as a rule ... but will probably not be needed in this case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "any code that tags any network besides swarm-mode network named "ingress" as an ingress network": you can create an overlay network named anything after removing existing ingress as long as it marked with --ingress flag. E.g. (after deleting existing ingress) docker network create --ingress -d overlay testingress works

@thaJeztah
Copy link
Member

Okay, think this should be ready to go; we can address the other points in follow-ups (as discussed above)

@ctelfer
Copy link
Contributor Author

ctelfer commented Mar 13, 2018

Thanks for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants