-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix automatic removal of ingress sandbox when last service leaves #36538
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ctelfer thanks for also adding the integration-cli test for this tricky test scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM |
Codecov Report
@@ Coverage Diff @@
## master #36538 +/- ##
=========================================
Coverage ? 34.72%
=========================================
Files ? 612
Lines ? 45403
Branches ? 0
=========================================
Hits ? 15765
Misses ? 27577
Partials ? 2061 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ctelfer SGTM, but can you reference the upstream PR on libnetwork
(we'll need to wait for it to get merged to be able to merge it here)
@vdemeester Thanks for the review. At the time I filed the PR, the libnetwork change wasn't merged. I've updated the vendoring now so it references the appropriate libnetwork commit directly. CI seems to be in progress. For reference, the libnetwork PR for this fix is moby/libnetwork#2097. Thanks again! |
@ctelfer looks like something went wrong with vendoring;
|
facepalm Updated the vendor.conf but forgot to actually run vndr. :( |
The commit moby#35422 had the result of accidentally causing the removal of the ingress network when the last member of a service left the network. This did not appear in swarm instances because the swarm manager would still maintain and return cluster state about the network even though it had removed its sandbox and endpoint. This test verifies that after a service gets added and removed that the ingress sandbox remains in a functional state. Signed-off-by: Chris Telfer <[email protected]>
This PR prevents automatic removal of the load balancing sandbox endpoint when the endpoint is the last one in the network but the network is marked as ingress. Signed-off-by: Chris Telfer <[email protected]>
Ingress networks will no longer automatically remove their load-balancing endpoint (and sandbox) automatically when the network is otherwise upopulated. This is to prevent automatic removal of the ingress networks when all the containers leave them. Therefore explicit removal of an ingress network also requires explicit removal of its load-balancing endpoint. Signed-off-by: Chris Telfer <[email protected]>
Looks like a test in the experimental tests failed. It appears to be an error in swarm teardown in TestAPISwarmRestartCluster. Is there a way to easily run just this test so I can try to reproduce it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, but left some questions/notes for possible follow-ups
@@ -222,6 +222,8 @@ func (daemon *Daemon) releaseIngress(id string) { | |||
return | |||
} | |||
|
|||
daemon.deleteLoadBalancerSandbox(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was looking why this didn't return errors, but looks like all errors are only logged as warnings. Perhaps (not for this PR) we should rename that function to tryDeleteLoadBalancerSandbox()
;
Lines 505 to 534 in 0b0af85
func (daemon *Daemon) deleteLoadBalancerSandbox(n libnetwork.Network) { | |
controller := daemon.netController | |
//The only endpoint left should be the LB endpoint (nw.Name() + "-endpoint") | |
endpoints := n.Endpoints() | |
if len(endpoints) == 1 { | |
sandboxName := n.Name() + "-sbox" | |
info := endpoints[0].Info() | |
if info != nil { | |
sb := info.Sandbox() | |
if sb != nil { | |
if err := sb.DisableService(); err != nil { | |
logrus.Warnf("Failed to disable service on sandbox %s: %v", sandboxName, err) | |
//Ignore error and attempt to delete the load balancer endpoint | |
} | |
} | |
} | |
if err := endpoints[0].Delete(true); err != nil { | |
logrus.Warnf("Failed to delete endpoint %s (%s) in %s: %v", endpoints[0].Name(), endpoints[0].ID(), sandboxName, err) | |
//Ignore error and attempt to delete the sandbox. | |
} | |
if err := controller.SandboxDestroy(sandboxName); err != nil { | |
logrus.Warnf("Failed to delete %s sandbox: %v", sandboxName, err) | |
//Ignore error and attempt to delete the network. | |
} | |
} | |
} |
- What are the side-effects if deleting fails?
- Should that function fail if
len(endpoints) > 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very fair question and one discussed a bit in moby/libnetwork#2097. Ultimately, this function was copied into libnetwork about 6 months ago but the original never got removed. As it was, this was dead code. There is a function of the same name and behavior in libnetwork and it is not exported. It would be possible to just export said function (changing the libnetwork API, but in a small way). However, given that
1. I'm working on a patch in the near term to refactor of load balancing to improve scalability
2. This refactoring might affect how we expose said function
3. We are trying to get this regression fixed for the immanent release
It seemed more prudent to make the smaller change to fix the regression and then clean up the duplicate code in the larger refactorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; yes it's not new code, so I'm good with keeping it for a follow-up. Just was looking at it, and it felt like a lot of error conditions were ignored, so left that comment for later investigation if needed. 👍
err = client.ServiceRemove(context.Background(), serviceID) | ||
require.NoError(t, err) | ||
|
||
poll.WaitOn(t, serviceIsRemoved(client, serviceID), pollSettings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but we should update testenv.ProtectAll()
and testenv.Clean()
to also cleanup services.
moby/internal/test/environment/protect.go
Lines 32 to 43 in 3e0299f
// ProtectAll protects the existing environment (containers, images, networks, | |
// volumes, and, on Linux, plugins) from being cleaned up at the end of test | |
// runs | |
func ProtectAll(t testingT, testEnv *Execution) { | |
ProtectContainers(t, testEnv) | |
ProtectImages(t, testEnv) | |
ProtectNetworks(t, testEnv) | |
ProtectVolumes(t, testEnv) | |
if testEnv.OSType == "linux" { | |
ProtectPlugins(t, testEnv) | |
} | |
} |
moby/internal/test/environment/clean.go
Lines 25 to 42 in 3e0299f
// Clean the environment, preserving protected objects (images, containers, ...) | |
// and removing everything else. It's meant to run after any tests so that they don't | |
// depend on each others. | |
func (e *Execution) Clean(t testingT) { | |
client := e.APIClient() | |
platform := e.OSType | |
if (platform != "windows") || (platform == "windows" && e.DaemonInfo.Isolation == "hyperv") { | |
unpauseAllContainers(t, client) | |
} | |
deleteAllContainers(t, client, e.protectedElements.containers) | |
deleteAllImages(t, client, e.protectedElements.images) | |
deleteAllVolumes(t, client, e.protectedElements.volumes) | |
deleteAllNetworks(t, client, platform, e.protectedElements.networks) | |
if platform == "linux" { | |
deleteAllPlugins(t, client, e.protectedElements.plugins) | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #36581 for that
poll.WaitOn(t, swarmIngressReady(client), pollSettings) | ||
|
||
var instances uint64 = 1 | ||
serviceName := "TestIngressService" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to get rid of names where possible (at least be sure they're unique (could use t.Name()
to make them slightly more unique)).
I see there's a couple of other locations already, so also for a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #36583 for that ^^
|
||
} | ||
|
||
const ingressNet = "ingress" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need a test with a custom ingress network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair question. I'm not aware of any code that tags any other networks besides a swarm-mode network named "ingress" as an ingress network. Neither am I aware of our support for such a construct. However, that may just be my lack of experience w/ the product.
My use of the named constant was more out of a general purpose aversion to magic constants sprinkled through code. (i.e. collecting them to a named token makes refactoring easier and saner as a rule ... but will probably not be needed in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "any code that tags any network besides swarm-mode network named "ingress" as an ingress network": you can create an overlay network named anything after removing existing ingress as long as it marked with --ingress
flag. E.g. (after deleting existing ingress) docker network create --ingress -d overlay testingress
works
Okay, think this should be ready to go; we can address the other points in follow-ups (as discussed above) |
Thanks for all the help! |
- What I did
Patched libnetwork to avoid automatically deleting ingress networks when the only endpoint remaining in the network was the ingress load balancer IP. Then adjust the moby code so that it explicitly removes this endpoint when explicit ingress removal is requested. The issue this fixes is detailed in moby/libnetwork#2097.
- How I did it
Two separate patches are required. The first is to libnetwork to avoid an issue introduced in moby/libnetwork#2011 that automatically removes any network (including ingress) when the number of endpoints drops to 1 and the network contains a load balancing IP. The patch prevents the removal if the network is marked as ingress. The second patch adds logic to the daemon.releaseIngress() call in moby to remove the ingress load balancing endpoint before invoking libnetwork.network.Delete().
- How to verify it
I added a test to the moby integration tests called TestServiceWithIngressNetwork. This test:
If one applies the test patch (the first in the series) without the other two fix patches, the test will fail noting ingress corruption. If one applies the entire PR set of patches, and runs the test, it will pass. (running the test as
TESTFLAGS='-test.run TestServiceWithIngressNetwork' make test-integration
)- Description for the changelog
Prevent implicit removal of the ingress network
- A picture of a cute animal (not mandatory but encouraged)