Create docker-proxy TCP/UDP listener sockets in the daemon by robmry · Pull Request #48132 · moby/moby

robmry · 2024-07-04T11:30:20Z

- What I did

Before commit 4f09af6, when allocating host ports for a new port mapping iptables rules were set up, then docker-proxy was started. If the host port was already in-use, docker-proxy exited with an error, and the iptables rules were removed. That could potentially interfere with a non-docker service that was already using the host port for something unrelated.

Commit 4f09af6 swapped that problem for a different one... in order to check that a port was available before creating iptables rules, it attempted to start docker-proxy first. If it failed, it could then try a different host port, without interfering with any other service. The problem with that is docker-proxy would start listening before the iptables rules were in place, so it could accept connections then become unusable because new NAT rules diverted packets directly to the container. This would leave the client with a broken connection, causing at-least a delay while it figured that out and reconnected.

- How I did it

This change creates and binds the socket in the daemon, before creating iptables rules. If the bind fails, it may try a different port. When or if the bind succeeds, iptables rules are created, then the daemon calls listen on the socket. If docker-proxy is needed, the socket is handed over to it at that point.

If docker-proxy is not needed (--userland-proxy=false), the daemon still listens on TCP sockets as the dummyProxy would have done. This makes the socket show up in netstat output.

The dummyProxy is no longer needed on Linux. Its job was to bind the host ports if docker-proxy was disabled, but that's now already handled by binding the sockets early.

This change doesn't affect SCTP, because it's not currently possible for docker-proxy to convert the file descriptor into an SCTPListener. So, docker-proxy is still started early, and the window for lost connections remains.

A new option has been added to docker-proxy, -use-listen-fd to tell it that fd 4 is a listener socket. It no longer needs to know the "frontend" address/port for TCP/UDP, but it's useful to see the options in the command line to work out which docker-proxy is which. (As a follow-up, it'd be possible to read the address/port from the fd and display it in the process title.)

If the user has an old docker-proxy in their path it'll fail for TCP/UDP, because it won't recognise the new command line option that says an listen-fd has been supplied as fd 4. In this case, the daemon's error message suggests checking $PATH.

RootlessKit used a proxy for docker-proxy called rootlesskit-docker-proxy. Its job was to make RootlessKit API calls to tell it about mapped host ports when necessary, before starting the usual docker-proxy, and then to remove that mapping when stopping the proxy. The rootlesskit-docker-proxy didn't know about new option -use-listen-fd, or the open sockets passed-in. Rather than update rootlesskit-docker-proxy, replace it by making the RootlessKit API calls directly from the daemon. That'll make it possible to continue to evolve docker-proxy, without having to keep rootlesskit-docker-proxy in-step (for example, we may go to a single docker-proxy per container rather than per port-mapping). It also makes it possible to set up those mappings in RootlessKit without running the proxy, when --userland-proxy=false. Because rootlesskit-docker-proxy is no longer used, it's no longer built.

- How to verify it

Updated tests.

The port mapping tests relied on startProxy being called to check that expected ports had been selected. That call is not made now, becausedummyProxy isn't used. So, changed most of the tests to make them start the proxy.

For tests of behaviour where a host port is unexpectedly unavailable, a real bind() now needs to fail in the daemon code being tested (the test's mock proxy can't report that it would have failed, because it's called much later in the process). So, the tests now set up a dummy bridge in their n/w namespace, add any host addresses necessary to that bridge, and bind ports that need to be busy - before the test starts.

With an old docker-proxy in the path, the error looks like this ...

docker: Error response from daemon: driver failed programming external connectivity on endpoint fervent_brattain (b4a9c5ebe14946f70c12c304da95abf41ed8b804903879c1163030ddbdbf1273): failed to start userland proxy for port mapping 0.0.0.0:8080:172.21.0.2:80/tcp: failed to start docker-proxy, check that the current version is in your $PATH.

(I hacked a NewSCTPListener(fd) into the sctp package, and checked that it worked with the socket bound in the daemon. So we can propose a change like that, then align SCTP socket creation with TCP/UDP.)

Rootless

Mapping from unspecified host address ...

$ docker run --rm -ti --network n1 -p 8080:80 nginx
$ ps -ef | grep docker-proxy
robm       39815   39165  0 18:37 ?        00:00:00 /home/robm/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 8080 -container-ip 172.18.0.2 -container-port 80
robm       39822   39165  0 18:37 ?        00:00:00 /home/robm/bin/docker-proxy -proto tcp -host-ip ::1 -host-port 8080 -container-ip 172.18.0.2 -container-port 80

These work ...
curl http://192.168.175.43:8080 # host ip
curl http://127.0.0.1:8080
curl http://[::1]:8080

Mapping from a specific host address ...

$ docker run --rm -ti --network n1 -p 192.168.175.43:8080:80 nginx
$ ps -ef | grep docker-proxy
robm       39978   39165  0 18:39 ?        00:00:00 /home/robm/bin/docker-proxy -proto tcp -host-ip 127.0.0.1 -host-port 8080 -container-ip 172.18.0.2 -container-port 80

Mapping from two specific host addresses, before the change ...

$ docker run --rm -ti --network n1 -p 192.168.175.43:8080:80 -p 127.0.0.1:8080:80 nginx
<long delay, the 16s timeout on trying to start rootlesskit-docker-proxy>
docker: Error response from daemon: driver failed programming external connectivity on endpoint admiring_williamson (b3d757bf1d0832fcc85eccd9692c72fb24ba2ec817495e977824516cbfbdcda2): failed to bind port 192.168.175.43:8080/tcp: Timed out proxy starting the userland proxy.

<then subsequent valid requests for the same mapped port fail with an error from RootlessKit>
$ docker run --rm -ti --network n1 -p 192.168.175.43:8080:80 nginx
docker: Error response from daemon: driver failed programming external connectivity on endpoint xenodochial_taussig (24118b2865fa4fc3f314cf172c21d1118e3a28aa0cccce82d33ceff002eaf3a9): failed to bind port 192.168.175.43:8080/tcp: Error starting userland proxy: error while calling PortManager.AddPort(): conflict with ID 2.

<need to restart rootlesskit>

After the change - binding fails immediately, because both mappings end up trying to use the same child address ...

$ docker run --rm -ti --network n1 -p 192.168.175.43:8080:80 -p 127.0.0.1:8080:80 nginx
docker: Error response from daemon: driver failed programming external connectivity on endpoint charming_johnson (1d749d940f091a93108d8364ae658a9c8b44b57b362963cfc18f5e01cf904f25): failed to bind port 192.168.175.43:8080/tcp: Error starting userland proxy: failed to listen on 127.0.0.1:8080: listen tcp4 127.0.0.1:8080: bind: address already in use.

With --userland-proxy=false ...

$ docker run --rm -ti --network n1 -p 8080:80 nginx
-> no docker-proxy process
-> curl works for http://127.0.0.1:8080, http://192.168.175.43:8080 - but not http://[::1]:8080

$ docker run --rm -ti --network n1 -p 192.168.175.43:8080:80 nginx
-> no docker-proxy process
-> curl works for http://192.168.175.43:8080

- Description for the changelog

- Close a window in which `docker-proxy` could accept TCP connections, which would fail after NAT rules were set up. The `docker-proxy` binary has been updated, the old version will not work with the updated `dockerd`.
- The executable `rootlesskit-docker-proxy` is no longer used, it has been removed from the build.

corhere

I have a suggestion: go all Single Responsibility Principle on the proxy implementations and make it the caller's responsibility to construct the net.Listener. https://github.com/robmry/moby/compare/bind_socket_for_docker_proxy...corhere:moby:bind_socket_for_docker_proxy?expand=1

I think docker-proxy should be backwards compatible with older dockerd. It's easy enough to do (if my suggestion is followed...) and could save users some grief if they accidentally mix versions of the binaries.

robmry · 2024-07-08T18:16:00Z

I have a suggestion: go all Single Responsibility Principle on the proxy implementations and make it the caller's responsibility to construct the net.Listener. https://github.com/robmry/moby/compare/bind_socket_for_docker_proxy...corhere:moby:bind_socket_for_docker_proxy?expand=1

I think docker-proxy should be backwards compatible with older dockerd. It's easy enough to do (if my suggestion is followed...) and could save users some grief if they accidentally mix versions of the binaries.

Thanks Cory - I picked up your changes and rebased/reworked a bit.

As discussed, before main() Go, at-least on Linux, has used fd 4 (5, 6, for and epoll and a pipe). So, I added a command line option to tell docker-proxy that the listen fd is present. An old docker-proxy won't understand that flag, it just exits with no response on the pipe to the daemon ... so, I used that to suggest checking $PATH. If used with an old daemon, the new docker-proxy will open sockets as it did before.

robmry · 2024-07-09T14:44:39Z

The rootless tests are failing because, it turns out, there's a rootlesskit-docker-proxy (a docker-proxy proxy) - and it doesn't know about the new -use-listen-fd option.

I think the tests passed before the last set of updates added that option, even though dockerd would already have bound the ports in the namespace it's using. Maybe dockerd needs to know if it's rootless and not-bind the ports, or just not use the new option to say it has. Or, something. I'll investigate ...

corhere · 2024-07-09T15:41:15Z

rootlesskit-docker-proxy (a docker-proxy proxy)

That smells a lot like a port-forwarding "driver" for rootless Docker. I wonder if it would be worthwhile to move the rootlesskit client stuff into the dockerd side so rootlesskit-docker-proxy can go away. Dockerd is already integrating with rootlesskit in other respects, and getting rootlesskit-docker-proxy out of the way would make it that much more practical to tackle e.g. implementing proxy-per-container multiplexing.

robmry · 2024-07-10T17:24:12Z

rootlesskit-docker-proxy (a docker-proxy proxy)

That smells a lot like a port-forwarding "driver" for rootless Docker. I wonder if it would be worthwhile to move the rootlesskit client stuff into the dockerd side so rootlesskit-docker-proxy can go away. Dockerd is already integrating with rootlesskit in other respects, and getting rootlesskit-docker-proxy out of the way would make it that much more practical to tackle e.g. implementing proxy-per-container multiplexing.

Thanks Cory, I've had a go at that - there are three new commits ...

Make the RootlessKit API calls from the daemon before starting docker-proxy (instead of rootlesskit-docker-proxy).
Make those calls in rootless mode, even when --userland-proxy=false, which didn't work in rootless mode before.
Remove rootlesskit-docker-proxy from the build.

@AkihiroSuda - it'd be good to get your thoughts on this, there may well be something we missed.

In preparation for the daemon passing a listen fd, add command line option -use-listen-fd to indicate that the fd is present (as fd 4). If the new flag isn't given, open the listener as normal. Refactor the TCP and UDP proxies to be constructed with an existing TCPListener or UDPConn, respectively. Lift the responsibilty of opening the listener to the entrypoint. Per the Single Responsibility Principle, this structure affords changing how the listener is created without having to touch the proxy implementations. Co-authored-by: Cory Snider <[email protected]> Signed-off-by: Rob Murray <[email protected]>

Signed-off-by: Cory Snider <[email protected]>

It was only used in tests. Signed-off-by: Cory Snider <[email protected]>

Get rid of "FIXME: Got an API for which error does not match any expected type!!! error="driver failed programming external connectivity on endpoint..." from debug logs. Signed-off-by: Rob Murray <[email protected]>

Before this change, when running rootless, instead of running docker-proxy the daemon would run rootlesskit-docker-proxy. The job of rootlesskit-docker-proxy was to tell RootlessKit about mapped host ports before starting docker-proxy, and then to remove the mapping when it was stopped. So, rootlesskit-docker-proxy would need to be kept in-step with changes to docker-proxy (particuarly the upcoming change to bind TCP/UDP ports in the daemon and pass them to the proxy, but also possible-future changes like running proxy per-container rather than per-port-mapping). This change runs the docker-proxy in rootless mode, instead of rootlesskit-docker-proxy, and the daemon itself tells RootlessKit about changes in host port mappings. Signed-off-by: Rob Murray <[email protected]>

It's not needed, now the daemon tells RootlessKit about port mappings directly. Signed-off-by: Rob Murray <[email protected]>

Before commit 4f09af6, when allocating host ports for a new port mapping, iptables rules were set up then docker-proxy was started. If the host port was already in-use, docker-proxy exited with an error, and the iptables rules were removed. That could potentially interfere with a non-docker service that was already using the host port for something unrelated. Commit 4f09af6 swapped that problem for a different one... in order to check that a port was available before creating iptables rules, it attempted to start docker-proxy first. If it failed, it could then try a different host port, without interfering with any other service. The problem with that is docker-proxy would start listening before the iptables rules were in place, so it could accept connections then become unusable because new NAT rules diverted packets directly to the container. This would leave the client with a broken connection, causing at-least a delay while it figured that out and reconnected. This change creates and binds the socket in the daemon, before creating iptables rules. If the bind fails, it may try a different port. When or if the bind succeeds, iptables rules are created, then the daemon calls listen on the socket. If docker-proxy is needed, the socket is handed over to it at that point. In rootless mode, the ports have to be bound to an address in the rootless network namespace (where dockerd is running). DNAT rules now use the same address. If docker-proxy is not needed ("--userland-proxy=false"), the daemon still listens on TCP sockets as the old dummyProxy would have done. This makes the socket show up in "netstat" output. The dummyProxy is no longer needed on Linux. Its job was to bind the host ports if docker-proxy was disabled, but that's now already handled by binding the sockets early. This change doesn't affect SCTP, because it's not currently possible for docker-proxy to convert the file descriptor into an SCTPListener. So, docker-proxy is still started early, and the window for lost connections remains. If the user has an old docker-proxy in their path and it's given a listener docker with '-use-listen-fd', it'll fail because of the unknown option. In this case, the daemon's error message suggests checking $PATH. Signed-off-by: Rob Murray <[email protected]>

The daemon was modified to tell RootlessKit about host port mappings directly, rather than by running rootlesskit-docker-proxy to make those updates. DNAT rules created in rootless mode referred to the host IP address, rather than the address seen as host address in the rootless network namespace. With these changes, port mappings work in rootless mode when --userland-proxy=false - so, don't gate the RootlessKit API calls on starting docker-proxy. Signed-off-by: Rob Murray <[email protected]>

thaJeztah · 2024-08-06T15:31:59Z

Two quick questions;

The location of the userland-proxy binary is configurable; do we know if there's existing alternative implementations that try to match ABI compatibility? I guess we may fail if we'd encounter one of those, as they don't know about the new use-listen-fd option? Have we tried what this situation would look like (e.g. by using a v27.1.1 version of the proxy with this PR)?
☝️ I do expect this to be a really niche case, but mostly wondering how loud it fails, and if the error is clear enough for users in such a situation to trace back the cause.
Slightly similar; how do we handle proxies when live-restore is enabled? Do we keep those running? And in that case, what happens when updating docker (thus: updating to a newer version of docker-proxy) ?
☝️ Note that we do NOT support live-restore when doing "major version updates" (only patch releases), so depending on what version update this lands in may be just "expected behavior", but just thought I'd ask. (And to get an understanding what amount of release-notes are needed to warn users about these).

thaJeztah · 2024-08-06T15:33:45Z

Also FWIW; regardless what version this PR lands in, the first commit would likely be fine on its own; it's an unused binary, and with docker/docker-ce-packaging#1045 in place, means that our packaging scripts also no longer expect the unused binary to be present.

robmry · 2024-08-06T15:45:10Z

The location of the userland-proxy binary is configurable; do we know if there's existing alternative implementations that try to match ABI compatibility? I guess we may fail if we'd encounter one of those, as they don't know about the new use-listen-fd option? Have we tried what this situation would look like (e.g. by using a v27.1.1 version of the proxy with this PR)?

An old version of our docker-proxy fails with a long-winded error message that ends with failed to start docker-proxy, check that the current version is in your $PATH. (There's a full version in the PR description.)

But no, I don't know if anyone's implemented their own version. If they have, and they ignore the new --use-listen-fd option, it'll probably fail because the ports are already bound... and the error response from their version of the proxy command is under their control.

Slightly similar; how do we handle proxies when live-restore is enabled? Do we keep those running? And in that case, what happens when updating docker (thus: updating to a newer version of docker-proxy) ?

We don't keep them running ... which seems pretty wrong. But, at least it makes this simpler.

thaJeztah

LGTM

thanks for answering my questions!

robmry self-assigned this Jul 4, 2024

robmry added area/networking Networking kind/bugfix PR's that fix bugs area/networking/proxy Networking area/networking/portmapping Networking version/27.0 labels Jul 4, 2024

robmry added this to the 28.0.0 milestone Jul 4, 2024

robmry force-pushed the bind_socket_for_docker_proxy branch 2 times, most recently from 76afba0 to 50ed3a8 Compare July 4, 2024 13:56

robmry marked this pull request as ready for review July 4, 2024 16:27

robmry requested review from akerouanton and corhere July 4, 2024 16:27

corhere reviewed Jul 6, 2024

View reviewed changes

Comment thread cmd/docker-proxy/main.go Outdated

Comment thread cmd/docker-proxy/proxy.go Outdated

Comment thread cmd/docker-proxy/tcp_proxy.go Outdated

Comment thread cmd/docker-proxy/tcp_proxy.go Outdated

Comment thread cmd/docker-proxy/udp_proxy.go Outdated

robmry force-pushed the bind_socket_for_docker_proxy branch from 50ed3a8 to e6728e6 Compare July 8, 2024 18:11

corhere reviewed Jul 8, 2024

View reviewed changes

Comment thread cmd/docker-proxy/main.go Outdated

Comment thread libnetwork/drivers/bridge/port_mapping_linux.go Outdated

Comment thread libnetwork/portmapper/proxy.go

Comment thread libnetwork/portmapper/proxy_linux.go Outdated

robmry force-pushed the bind_socket_for_docker_proxy branch from e6728e6 to b021e87 Compare July 9, 2024 10:59

corhere reviewed Jul 9, 2024

View reviewed changes

Comment thread libnetwork/drivers/bridge/port_mapping_linux.go Outdated

robmry force-pushed the bind_socket_for_docker_proxy branch from b021e87 to 66a6bf3 Compare July 10, 2024 16:45

robmry requested a review from tianon as a code owner July 10, 2024 16:45

robmry requested a review from AkihiroSuda July 10, 2024 17:25

AkihiroSuda removed the version/27.0 label Jul 11, 2024

AkihiroSuda reviewed Jul 11, 2024

View reviewed changes

Comment thread libnetwork/drivers/bridge/rootlesskit_client_linux.go Outdated

AkihiroSuda reviewed Jul 11, 2024

View reviewed changes

Comment thread libnetwork/drivers/bridge/rootlesskit_client_linux.go Outdated

AkihiroSuda reviewed Jul 11, 2024

View reviewed changes

Comment thread libnetwork/portmapper/proxy_linux.go

robmry force-pushed the bind_socket_for_docker_proxy branch from 66a6bf3 to b0caf36 Compare July 11, 2024 17:37

AkihiroSuda approved these changes Jul 12, 2024

View reviewed changes

robmry and others added 8 commits August 5, 2024 14:04

cmd/docker-proxy: drop unused BackendAddr() method

88fb0c9

Signed-off-by: Cory Snider <[email protected]>

cmd/docker-proxy: drop FrontendAddr() method

7a8663c

It was only used in tests. Signed-off-by: Cory Snider <[email protected]>

Fix error type used for ProgramExternalConnectivity

384ca56

Get rid of "FIXME: Got an API for which error does not match any expected type!!! error="driver failed programming external connectivity on endpoint..." from debug logs. Signed-off-by: Rob Murray <[email protected]>

Remove rootlesskit-docker-proxy

dac7ffa

It's not needed, now the daemon tells RootlessKit about port mappings directly. Signed-off-by: Rob Murray <[email protected]>

robmry force-pushed the bind_socket_for_docker_proxy branch from fe9c7b1 to cdea750 Compare August 5, 2024 13:04

thaJeztah mentioned this pull request Aug 6, 2024

[master] static: remove unused docker-proxy.exe binary from Windows packages docker/docker-ce-packaging#1045

Merged

vvoland added the process/cherry-pick/27.2 label Aug 8, 2024

thaJeztah approved these changes Aug 8, 2024

View reviewed changes

thaJeztah merged commit 8d06e70 into moby:master Aug 8, 2024

This was referenced Aug 10, 2024

remove leftovers for building docker-proxy on Windows #48318

Merged

[27.x backport] Create docker-proxy TCP/UDP listener sockets in the daemon #48338

Closed

vvoland removed the process/cherry-pick/27.2 label Aug 16, 2024

robmry deleted the bind_socket_for_docker_proxy branch September 16, 2024 08:42

akerouanton mentioned this pull request Oct 2, 2024

cmd/docker-proxy: re-add SO_REUSEADDR #48567

Merged

robmry added the impact/changelog label Nov 4, 2024

antonireus mentioned this pull request Jan 15, 2025

Fix Exasol flaky tests. Avoid exposing not required ports trinodb/trino#24710

Merged

AkihiroSuda mentioned this pull request Feb 20, 2025

v28 regression: protocol "tcp" is not supported by the RootlessKit port driver "slirp4netns", discarding request for "[::]:8080" #49499

Closed

thaJeztah mentioned this pull request Feb 24, 2025

libnet/d/bridge: fix compilation on i386 #49526

Merged

robmry mentioned this pull request May 22, 2025

Avoid selecting duplicate host ports for mappings to 0.0.0.0 and specific addresses #50054

Merged

crazy-max mentioned this pull request Jun 8, 2025

pkg(docker-engine): multiple fixes docker/packaging#205

Merged

robmry mentioned this pull request Aug 11, 2025

libnet/pa: OSAllocator: retry allocations #50691

Merged

This was referenced Sep 24, 2025

guestagent: iptables watcher should not be needed at all lima-vm/lima#4085

Closed

rootful: reserve the ports on the host containerd/nerdctl#4526

Merged

Conversation

robmry commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corhere left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robmry commented Jul 8, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robmry commented Jul 9, 2024

Uh oh!

Uh oh!

corhere commented Jul 9, 2024

Uh oh!

robmry commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thaJeztah commented Aug 6, 2024

Uh oh!

thaJeztah commented Aug 6, 2024

Uh oh!

robmry commented Aug 6, 2024

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

robmry commented Jul 4, 2024 •

edited

Loading

robmry commented Jul 10, 2024 •

edited

Loading