Allocate IPv6 addresses after detecting IPv6 support#47406
Allocate IPv6 addresses after detecting IPv6 support#47406robmry merged 7 commits intomoby:masterfrom
Conversation
9f40892 to
9b7b3c0
Compare
9b7b3c0 to
4a26d99
Compare
4a26d99 to
c37c9d4
Compare
157a62b to
ee61361
Compare
corhere
left a comment
There was a problem hiding this comment.
If you are going to insist that IPAM must continue to happen when the endpoint is created, I would rather go back to the way you had it before where the daemon inspected the sandbox to decide whether or not to enable IPv6 on the endpoint. What you have currently is the same thing with many extra steps and lots of complexity without any upside.
Why is deferring IPAM allocation to Join/Restore infeasible and what do manually-assigned addresses have to do with it?
ee61361 to
ec2be5b
Compare
ec2be5b to
e478178
Compare
|
I've un-merged CreateEndpoint/Endpoint.Join, so that the Sandbox can be configured for legacy links between the two. CreateEndpoint still needs to know whether to set up IPv6 addresses, so I've added CreateEndpointForSandbox that takes the Sandbox as a param. Once we get rid of legacy links perhaps we'll make it do the Join as well, but there's no need to do that as part of this change. Moving address assignment to Join would mean CreateEndpoint doesn't need to see the Sandbox, but it wouldn't help with legacy links ... Join would need a callback after allocating addresses, before doing the Join. The final commit isn't intended to be merged in its current state, if we want to do it or something like it, the changes will need to be pushed down into the other commits - it's just an experiment/proposal for removing references to legacy-links from the libnetwork code, while making the Sandbox update for legacy links type safe.
That sounds good. But, the Sandbox options to implement legacy links are very legacy-linky, not generic things that we'll want to leave behind for other purposes when we drop support for legacy links. For example ... Even So, I don't think removing references to legacy links gives us anything but obfuscation. But, I know there are strong feelings on the subject - so, can do something like the change in the final commit? Or, is there a better approach? |
e478178 to
669cd2e
Compare
|
Rebased to resolve merge conflicts and run up-to-date tests. |
72c93c0 to
6e19ae6
Compare
6e19ae6 to
57b26b9
Compare
|
Rebased again (new conflict in |
| return nil | ||
| } | ||
| for _, child := range children { | ||
| if !isLinkable(child) { |
There was a problem hiding this comment.
For a follow-up; now that this code is in a platform-specific file, we should inline this function (makes it clearer what it does);
moby/daemon/container_operations_unix.go
Lines 403 to 407 in 9eeec5f
And remove the windows stub;
moby/daemon/container_operations_windows.go
Lines 153 to 155 in 9eeec5f
| return errors.Wrapf(err, "failed to add IPv6 address to /etc/hosts for link to %s", child.Name) | ||
| } | ||
| } | ||
| cEndpointID = defaultNW.EndpointID |
There was a problem hiding this comment.
It's existing code, but this is confusing; i.e., we overwrite cEndpointID in each iteration, and cEndpointID is used elsewhere.
There was a problem hiding this comment.
Oh! I see Cory commented about that as well; 68a4761#r1526578113
There was a problem hiding this comment.
That link doesn't work - but I think it's #47406 (comment)
In any case, I think it's something for follow-up as it's not changed by this PR - which is already big enough!
daemon/container_operations.go
Outdated
| func (daemon *Daemon) initializeNetworking( | ||
| ctx context.Context, | ||
| cfg *config.Config, | ||
| ctr *container.Container, | ||
| ) (newSandbox *libnetwork.Sandbox, retErr error) { |
There was a problem hiding this comment.
Slightly tempted to keep this unwrapped; yes, it's a long line, but most of it can be skimped over, knowing that it's just the func's signature.
There was a problem hiding this comment.
Done. Is the rule "no line-wrapping in function declarations"?
(To me, it's just slightly harder to read and see the return types. And diffs won't pick out future changes as clearly.)
There was a problem hiding this comment.
Most of the codebase is written that way, but I admit I'm not a big fan of that. To @robmry's point, it's harder to read IMHO.
There was a problem hiding this comment.
Tempted to say; if the list of arguments for a function is becoming so long that it needs wrapping, that's a very good indicator that the signature may be in need of an overhaul 😄
There was a problem hiding this comment.
Two params and a return value plus context and error, seems ok? So we need shorter names?
3640838 to
bb99866
Compare
thaJeztah
left a comment
There was a problem hiding this comment.
Did another pass; overall looking good. Perhaps ready to go, but more eyes won't hurt on this one!
daemon/container_operations.go
Outdated
| if err := daemon.updateNetworkSettings(ctr, n, endpointConfig); err != nil { | ||
| return err | ||
| } | ||
| return nil |
There was a problem hiding this comment.
nit (but there may be some linters complaining);
return daemon.updateNetworkSettings(ctr, n, endpointConfig)| Config: &containertypes.Config{ | ||
| Hostname: "baz", | ||
| }, | ||
| HostConfig: &containertypes.HostConfig{}, |
There was a problem hiding this comment.
Curious (no need to change); did this panic somewhere? (in that case; are we missing a check somewhere to prevent that)?
There was a problem hiding this comment.
Yes ...
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x17aaf44]
goroutine 27 [running]:
testing.tRunner.func1.2({0x1aa5ce0, 0x3171920})
/usr/local/go/src/testing/testing.go:1631 +0x1c4
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1634 +0x33c
panic({0x1aa5ce0?, 0x3171920?})
/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/docker/docker/daemon.(*Daemon).updateNetworkSettings(0x400051dd00, 0x400051dab8, 0x40000504e0, 0x40001a5b20)
/go/src/github.com/docker/docker/daemon/container_operations.go:170 +0x224
github.com/docker/docker/daemon.(*Daemon).updateNetworkConfig(0x400051dd00, 0x400051dab8, 0x40000504e0, 0x40001a5b20)
/go/src/github.com/docker/docker/daemon/container_operations.go:631 +0x2ac
github.com/docker/docker/daemon.TestDNSNamesOrder(0x4000563d40)
/go/src/github.com/docker/docker/daemon/container_operations_test.go:35 +0x264
testing.tRunner(0x4000563d40, 0x1ee8270)
/usr/local/go/src/testing/testing.go:1689 +0xec
created by testing.(*T).Run in goroutine 1
/usr/local/go/src/testing/testing.go:1742 +0x318
Which comes from:
moby/daemon/container_operations.go
Line 170 in 0cba410
Looks like it's normally initialised in daemon.newContainer - and there's lots of code that assumes it exists:
Line 151 in afdfe4f
| sb.config.extraHosts = append(sb.config.extraHosts, extraHost{name: name, IP: ip}) | ||
| return sb.rebuildHostsFile(ctx) |
There was a problem hiding this comment.
Not entirely new (I think), but it caught my eye now;
This logic;
- Appends extrahosts
- Calls "rebuild" for the hosts file
But looking at the "rebuild";
- It generates a new host-file (to replace the existing one?) with the new config (which includes the new "extra hosts")
- After that, it updates the file for each endpoint
That last part is done in a loop, and for each iteration it adds an endpoint, and (from a quick look), locks the hosts-file and writes it to disk
I'm wondering;
- Could we could we skip the "updating" part if we would update the config before we call
sb.buildHostsFile()? - (or have
sb.buildHostsFile()accept the list of "additional hosts" to add) - ☝️ so that it has all the data it needs, and just has to construct (and write) once
func (sb *Sandbox) rebuildHostsFile(ctx context.Context) error {
if err := sb.buildHostsFile(); err != nil {
return errdefs.System(err)
}
for _, ep := range sb.Endpoints() {
if err := sb.updateHostsFile(ctx, ep.getEtcHostsAddrs()); err != nil {
return errdefs.System(err)
}
}
return nil
}If that's not an option, I noticed that sb.updateHostsFile() accepts a slice of addresses; so instead of updating it one at a time, would that mean we could just collect the endpoints, and pass it? Something like;
eps := sb.Endpoints()
addrs := make([]string, 0, len(eps))
for _, ep := range eps {
addrs = append(addrs, ep.getEtcHostsAddrs()...)
if err := sb.updateHostsFile(ctx, ep.getEtcHostsAddrs()); err != nil {
return errdefs.System(err)
}
}
if err := sb.updateHostsFile(ctx, addrs...); err != nil {
// NOTE: should sb.updateHostsFile already return a proper errdefs type (errdefs.System)?
return errdefs.System(err)
}
return nilThere was a problem hiding this comment.
Sandbox.AddHostsEntry is only used for legacy --link options for containers on the default bridge network, before they're started - and those containers are only likely to be connected to the default bridge (only one Endpoint) ... so I don't think there's much to gain by optimising it.
Sandbox.rebuildHostsFile is called from SetKey for a newly created container once the its support for IPv6 is known. That container could be connected to multiple endpoints - so maybe there are some msecs to be gained by accumulating addresses from all of the endpoints. But, optimising the whole etchosts thing so that it doesn't keep writing/reading/merging the contents (before a container starts and the user's had a chance to modify it), is probably the way to go if that's an issue.
These /etc/hosts entries are pretty questionable anyway for user-defined networks that have an internal DNS resolver. If we figure out a fix for buildkit, so that we can use the internal resolver for the default network too, even more so.
I don't really want to muck about with it in this PR, maybe as a follow-up if needed?
| } | ||
|
|
||
| return nil, ipv6Miss | ||
| // this map is to avoid IP duplicates, this can happen during a transition period where 2 services are using the same IP |
There was a problem hiding this comment.
Mostly curious; is there any order in the results? In this case, should "last" or "first" take precedence? (currently the first one is used, and the last one is discarded; wondering if we should loop in reverse).
There was a problem hiding this comment.
This PR only changes whitespace here ... but no, I don't think the order can matter - the results come from a map in the first place, and DNS resolvers are free to rotate addresses in responses anyway to balance load.
|
|
||
| return nil, ipv6Miss | ||
| // this map is to avoid IP duplicates, this can happen during a transition period where 2 services are using the same IP | ||
| noDup := make(map[string]bool) |
There was a problem hiding this comment.
Not relevant for this PR, but this could uses a map[string]struct{} to avoid allocations.
bb99866 to
1814519
Compare
Second attempt to stop using the OCI prestart hook to call SetKey to set up the OS Sandbox's key and perform network config in the new network namespace. The first attempt was reverted because it made it impossible to use --sysctl to set per-interface sysctls on an interface that had not yet been moved into the new network namespace. Now, per-interface sysctls can be used to do that (with less ambiguity because the setting is not tied to the interface using an unpredictably assigned name). Signed-off-by: Rob Murray <[email protected]>
When connecting a container to a new network, its NetworkSettings were unconditionally updated. But, when creating a new container, they were only updated if there were no NetworkSettings before a network was connected. But, that's always the case - so, make the update unconditionally. Signed-off-by: Rob Murray <[email protected]>
If config for legacy links needs to be added to a libnetwork.Sandbox, add it when constructing the Endpoint that needs it - removing the constraint on ordering of Endpoint construction, and the dependency between Endpoint and Sandbox construction. So, now a Sandbox can be constructed in one place, before the first Endpoint. Signed-off-by: Rob Murray <[email protected]>
Signed-off-by: Rob Murray <[email protected]>
For Linux, delay construction and configuration of network endpoints until the container has been created (but not started). Signed-off-by: Rob Murray <[email protected]>
When a container doesn't support IPv6 and it's joined to an IPv6 network, don't allocate an IPv6 address for it. Update the DNS resolver to understand that it can have an 'ipv6miss' (meaning an IPv4 address exists, but no IPv6) when a network is IPv6 enabled. Signed-off-by: Rob Murray <[email protected]>
Interface DNSBackend.ResolveName, implemented by Network, Sandbox (and noopDNSBackend) had a bool return value that meant 'ipv6Miss'. But, it was always set to true on a hit, and callers had to deal with that. So, changed the meaning of the return value to indicate whether the name was found - which will also work for 'ipv4Miss' when we have IPv6-only containers/networks. Signed-off-by: Rob Murray <[email protected]>
1814519 to
caf2d5d
Compare
- What I did
If IPv6 is disabled for a container, do not allocate an IPv6 address when it's attached to an IPv6 network.
- How I did it
There are a few commits - it's probably easiest to review them separately ...
Unconditionally update container.NetworkSettings
In
daemon.allocateNetwork(), flagupdateSettingswas set ifcontainer.NetworkSettings.Networkswas empty. That flag was passed down through the call stack to avoid a call todaemon.updateNetworkSettings()indaemon.updateNetworkConfig().Other calls to
daemon.updateNetworkSettings(), which happen when connecting an existing container to another network, unconditionally made the NetworkSettings update.I don't think there's a circumstance where NetworkSettings is not empty during container creation. Even if there is, the update is harmless and cheap. Eliminating the
updateSettingsflag simplifies the code a little and, in the next commit whereallocateNetwork()is split out ofinitializeNetworking(), it avoids the need to pass that flag between the two.Separate Sandbox/Endpoint construction
Previously, the
libnetwork.Sandboxwas created indaemon.allocateNetwork()if there was no network - otherwise it happened while connecting an Endpoint indaemon.connectToNetwork(). If there was an endpoint in the default bridge, it had to be connected first, because extra configuration is needed in the Sandbox for legacy links and that could only happen during Sandbox construction.Now, config for legacy links is added to the Sandbox when constructing the Endpoint that needs it - removing the constraint on ordering of Endpoint construction, and the dependency between Endpoint and Sandbox construction.
So, now a Sandbox can be constructed in one place, before the first Endpoint.
Also, replaces some legacy-link specific Sandbox option-setters with Sandbox methods for updating
/etc/hosts, and updates a legacy-link parent's IPv6 address as well as its IPv4 address when a child container restarts.Configure network endpoints after creating a container
This commit splits the bulk of
daemon.allocateNetwork()out ofdaemon.initializeNetworking().daemon.initializeNetworking()hasn't moved, but it now only prepares configuration and doesn't do any Endpoint construction or configuration. Sandbox construction still happens here.Endpoint construction is in
daemon.allocateNetwork()which, on non-Windows, is called after the container task has been created (before it's started).On Windows, Endpoint construction still happens before the container task is created. If it's created afterwards, some DNS lookups for the container don't seem to end up in our resolver - see the TODO comment in the code. (But, Windows can't benefit from the delayed assignment of IPv6 addresses anyway.)
Doing the Endpoint construction after the
osSboxhas been set up makes it possible to probe the container's network for IPv6 support first - enabling the next commit ...Only assign IPv6 addresses if required
If a Sandbox is provided to
CreateEndpointForSandbox, check whether it has IPv6 before allocating IPv6 addresses.Because no IPv6 address is added to the DNS when the Sandbox doesn't support it, a lookup for a container on an IPv6 network can now have no IPv6 address. The resolver is updated to treat a missing IPv6 address on an IPv4 hit as
ipv6misseven if the network has IPv6 enabled. (So, an empty DNS response is returned, avoiding the upstream DNS request and NXDOMAIN.)The
macvlandriver expected an endpoint on a network with IPv6 subnets to have an IPv6 address - it searched the subnets for the address assigned to the endpoint, to work out which gateway address to use (crashing if there was no address). Now, it only makes that search if there is an IPv6 address.- How to verify it
Existing regression tests for the refactoring.
New integration test checks that a container with IPv6 disabled via
sysctlon an IPv6 network has no IPv6 address.- Description for the changelog
If IPv6 is disabled for a container, do not allocate an IPv6 address when it's attached to an IPv6 network.