libnet/d/bridge: port mappings: filter by input iface by akerouanton · Pull Request #48721 · moby/moby

akerouanton · 2024-10-22T09:08:55Z

~~Fixes Publishing ports explicitly to private networks should not be accessible from LAN hosts #45610~~
Reverted by Revert "libnet/d/bridge: port mappings: filter by input iface" #49310

- What I did

When a NAT-based port mapping is created with a HostIP specified, we insert a DNAT rule in nat DOCKER to replace the dest addr with the container IP. Then, in filter chains, we allow access to the container port for any packet not coming from the container's network itself (if hairpinning is disabled), nor from another host bridge.

However we don't set any rule that prevent a rogue neighbor located in another L2 segment / subnet from sending packets destined to that HostIP.

For instance, if a port binding is created with HostIP == '127.0.0.1', this port should not be accessible from anything but the lo interface. That's currently not the case and this provides a false sense of security.

Since nat-DOCKER mangles the dest addr, and the nat table rejects DROP rules, this change adds rules into raw-PREROUTING to filter ingress packets destined to mapped ports based on the input interface, the dest addr and the dest port.

Interfaces are dynamically resolved when packets hit the host, thanks to iptables' addrtype extension. This extension does a fib lookup of the dest addr and checks that it's associated with the interface reached.

Also, when a proxy-based port mapping is created, as is the case when an IPv6 HostIP is specified but the container is only IPv4-capable, we don't set any sort of filtering. So the same issue might happen. The reason is a bit different - in that case, that's just how the kernel works. But, in order to stay consistent with NAT-based mappings, these rules are also applied.

The env var DOCKER_DISABLE_INPUT_IFACE_FILTERING can be set to any true-ish value to globally disable this behavior.

- How to verify it

A new regression / integration test TestAccessPublishedPortFromNonMatchingIface checks that remote access is correctly disallowed for ports published on loopback and routable addresses.

- Description for the changelog

- Add a couple of iptables rules to filter on the input interface for NAT port mappings. This will prevent rogue neighboring hosts from accessing port mappings that aren't published in the same subnet / L2 segment.
  - The env var `DOCKER_DISABLE_INPUT_IFACE_FILTERING` can be set to any `true`-ish value to globally disable this filtering. This is a temporary escape hatch and will be removed in a future release. Please report any issue if you need to use it.

robmry

As discussed - maybe worth clarifying in the commit message that the attacker needs to share an L2 segment / subnet with one of the docker host's interfaces (and it's the attacked interface that's on a different segment).

robmry · 2024-10-25T15:14:22Z

+		stdOut := strings.TrimSpace(actualStdout.String())
+		assert.Assert(t, !strings.Contains(stdOut, "foobar"))
+	}
+


It'd probably be worth adding a test that's expected to succeed? (By sending to the "eth-test" addresses I think?)

That'd help to confirm the packet's not getting through in the other tests because of the new rule - rather than the container log read being too quick, or some other test problem?

Maybe also run the tests twice, with the escape hatch env-var set and unset?

The tests now run with and without the escape hatch, and all test cases also run the 'happy case' to make sure the test is working as expected.

thaJeztah · 2024-10-31T12:33:38Z


 // WithIPv6 Enables IPv6 on the network
-func WithIPv6() func(*network.CreateOptions) {
+func WithIPv6(enable bool) func(*network.CreateOptions) {


I'm slightly on the fence on adding the argument (but could use input from others); perhaps a WithDisableIPv6() could make it easier to find places where we disable, and we could keep the WithIPv6 as-is.

Also curious what is the default if not set? (we don't have an option to reset, correct?)

I added WithDisableIPv6() and removed the commit that was changing WithIPv6() signature.

robmry

LGTM - except the new iptables doc needs to be manually added to integration/network/bridge/iptablesdoc/index.md.

thaJeztah · 2024-11-11T11:34:41Z

+	if v, _ := strconv.ParseBool(os.Getenv("DOCKER_DISABLE_INPUT_IFACE_FILTERING")); v {
+		return nil
+	}


Some questions about this env-var;

Do we consider this one to be permanent or temporary ?

Should we have a log entry to indicate this option was set?

If we consider it temporary, have a warning added to docker info ? (check if it's set, and if so, add a warning to the list of warnings returned)?

Do we consider this one to be permanent or temporary ?

That's a temporary escape hatch in case the new filtering does break someone, somewhere.

Should we have a log entry to indicate this option was set?

I added an info log to that code path. Now, whenever users create a container with NAT-based port-mappings, they'll get this logged:

DOCKER_DISABLE_INPUT_IFACE_FILTERING is set, skipping input iface filtering

If we consider it temporary, have a warning added to docker info ? (check if it's set, and if so, add a warning to the list of warnings returned)?

I added these two warnings to make it clear that it's insecure, and that it'll be removed at a later time.

WARNING: input interface filtering is disabled on port mappings, this might be insecure DEPRECATED: DOCKER_DISABLE_INPUT_IFACE_FILTERING is deprecated and will be removed in a future release

thaJeztah · 2024-11-14T10:29:17Z

+	// DOCKER_DISABLE_INPUT_IFACE_FILTERING can be used as an escape hatch if
+	// this filtering doesn't work out well for some users.
+	if v, _ := strconv.ParseBool(os.Getenv("DOCKER_DISABLE_INPUT_IFACE_FILTERING")); v {
+		log.G(context.TODO()).Info("DOCKER_DISABLE_INPUT_IFACE_FILTERING is set, skipping input iface filtering.")


Thanks! I'd even be ok with Warn for this if we consider it "bad", but Info works for me.

That's now a Warn.

thaJeztah · 2024-11-14T10:31:38Z

            libprotobuf-c1 \
            libyajl2 \
            net-tools \
+            netcat-openbsd \


Is this only for our tests, or is this a new dependency we should have installed (and to be added as "requires" in our packages?)

I can't find a direct reference to this in this PR; was it accidentally left behind (for local debugging?)

That's only for testing. Here's where it's used: https://github.com/moby/moby/pull/48721/files#diff-4fd76a7e4270f16a03d5160517ee2454558f127af10e95c55451963b17429414R817

thaJeztah · 2024-11-14T10:38:15Z

+				// not reach the container (ie. the container returns an ICMP
+				// 'Port Unreachable' message).
+				time.Sleep(50 * time.Millisecond)
+				icmd.RunCommand("/bin/sh", "-c", fmt.Sprintf("echo foobar | nc -w1 -u %s %s", hostAddr, hostPort)).Assert(t, icmd.Success)


OK 🙈 I went looking for netcat not nc; @akerouanton pointed me it's used here, and only needed for testing, so no packaging changes required.

akerouanton · 2024-11-14T10:54:55Z

+	// TODO(aker): we need to figure out what we want to do for rootlesskit.
+	skip.If(t, testEnv.IsRootless, "rootlesskit has its own netns")


@AkihiroSuda I'm wondering what we should do for rootless mode.

On one hand, accepting packets with a given dest addr from any interface is just how Linux sockets work. So this issue also exists with rootless mode I think.

OTOH, this proved to be surprising for many users. The biggest issue is for loopback addresses, ie. in non-rootless mode we'd happily forward packets with a dest addr in 127.0.0.0/8 and coming from any interface. Rootless mode shouldn't be susceptible to this security issue (because the iptables rules that cause this issue don't live in the host netns).

I'm wondering if rootless / non-rootless modes should diverge from each other (one protecting, and the other not protecting against that issue). Or if we should diverge from Linux sockets' semantics in both cases. WDYT?

Could you explain what is the attack scenario?

Rootless mode has its own 127.0.0.0/8 (until #47103 is merged) that is isolated from the host-side 127.0.0.0/8

There are two different attack scenarios.

First one involves ports bound to one of the loopback address. In that case, the attacker could send packets with the loopback address set as the daddr, and the target Docker host would accept them. That's not a problem for rootless mode because the port mapping is proxied by rootlesskit, and thus packets are correctly filtered by the kernel by default.

The second attack scenario involves ports mapped to a specific subnet, and an attacker part of another subnet. In that case, the attacker could send packets destined to the IP address where ports are mapped, without being part of that subnet, and the victim Docker host would accept them too. Whether it's a security issue is debatable because that's just how networking works on Linux. However, it can be fixed relatively easily in rootless mode through an iptables rule, but that's harder for a rootful Engine as you need to insert rule(s) in chains managed by Docker -- so we considered this as a vulnerability too.

But now with the current fix, that means rootful docker will be protected against both attack scenarios, and rootless mode will be affected by one.

So either rootful mode diverges from default kernel behavior, introducing a discrepancy between rootless and rootful mode, but keeping users at bay.

Or we could consider the 2nd scenario as not being a security issue, but then users are on their own to prevent such access patterns - and that means manually setting up iptables rules, ensuring these rules are always correctly positioned in the chains, etc...

thaJeztah

LGTM

thaJeztah · 2024-11-14T10:45:55Z

+// For instance, if HostIP == 127.0.0.1, no ingress should come from anything
+// but lo.
+func (n *bridgeNetwork) filterPortByInputIface(b portBinding, enable bool) error {
+	hostIP := b.childHostIP


nit (no need to change unless other changes are needed); this could've been put closer to where it's used (also because it may not be used at all due to early returns)

thaJeztah · 2024-11-14T11:00:00Z

+			d := newDriver()
+			d.nlh, err = nlwrap.NewHandle()


qNot for this PR, but I noticed the weird newDriver() / driver.configure() design, which is a bit .. odd, as it looks like driver.configure() is the actual constructor, and newDriver() is mostly a useless thing; (almost?) all locations where it's used, look like;

d := newDriver() if err := d.configure(options); err != nil { .... }

We should consider making newDriver accept the options (if my interpretation of the flow is correct); looks like this was the commit introducing some of it; d565a4d, although it was there in some form before that.

This was addressed in #49158, but I now realize that d.nlh should probably be initialized in newDriver too. I'll open another PR to fix that.

#49267 was merged; should this line be removed now?

oh; this is NewHandle() not NlHandle(), so not same I guess

When a NAT-based port mapping is created with a HostIP specified, we insert a DNAT rule in nat-DOCKER to replace the dest addr with the container IP. Then, in filter chains, we allow access to the container port for any packet not coming from the container's network itself (if hairpinning is disabled), nor from another host bridge. However we don't set any rule that prevents a rogue neighbor that shares a L2 segment with the host, but not the one where the port binding is expected to be published, from sending packets destined to that HostIP. For instance, if a port binding is created with HostIP == '127.0.0.1', this port should not be accessible from anything but the lo interface. That's currently not the case and this provides a false sense of security. Since nat-DOCKER mangles the dest addr, and the nat table rejects DROP rules, this change adds rules into raw-PREROUTING to filter ingress packets destined to mapped ports based on the input interface, the dest addr and the dest port. Interfaces are dynamically resolved when packets hit the host, thanks to iptables' addrtype extension. This extension does a fib lookup of the dest addr and checks that it's associated with the interface reached. Also, when a proxy-based port mapping is created, as is the case when an IPv6 HostIP is specified but the container is only IPv4-capable, we don't set any sort of filtering. So the same issue might happen. The reason is a bit different - in that case, that's just how the kernel works. But, in order to stay consistent with NAT-based mappings, these rules are also applied. The env var `DOCKER_DISABLE_INPUT_IFACE_FILTERING` can be set to any true-ish value to globally disable this behavior. Signed-off-by: Albin Kerouanton <[email protected]>

akerouanton · 2025-01-14T12:04:18Z

Last force-pushes were only rebase, so let me merge it.

yrro · 2025-01-14T13:27:16Z

This is excellent. Thanks @akerouanton and everyone else who worked on this.

akerouanton added status/2-code-review area/security area/networking Networking impact/changelog impact/documentation area/networking/firewalling Networking area/networking/d/bridge Networking area/networking/portmapping Networking labels Oct 22, 2024

akerouanton added this to the 28.0.0 milestone Oct 22, 2024

akerouanton self-assigned this Oct 22, 2024

akerouanton force-pushed the 45610-filter-by-input-iface branch 11 times, most recently from 52fa34f to 1a4e9ef Compare October 24, 2024 09:09

akerouanton mentioned this pull request Oct 24, 2024

inte/t/networking: delete veth ifaces before netns #48749

Merged

akerouanton force-pushed the 45610-filter-by-input-iface branch from 1a4e9ef to 519ef77 Compare October 25, 2024 08:00

akerouanton marked this pull request as ready for review October 25, 2024 14:33

akerouanton requested a review from robmry October 25, 2024 14:33

thaJeztah reviewed Oct 25, 2024

View reviewed changes

Comment thread internal/testutils/networking/host_addr_linux.go Outdated

robmry reviewed Oct 25, 2024

View reviewed changes

thaJeztah reviewed Oct 31, 2024

View reviewed changes

robmry mentioned this pull request Nov 4, 2024

[epic] iptables and port publishing improvements for moby 28.0 #48815

Closed

11 tasks

akerouanton force-pushed the 45610-filter-by-input-iface branch from 519ef77 to 402268c Compare November 8, 2024 09:28

akerouanton force-pushed the 45610-filter-by-input-iface branch from 402268c to 50c391b Compare November 8, 2024 10:17

robmry approved these changes Nov 8, 2024

View reviewed changes

thaJeztah reviewed Nov 11, 2024

View reviewed changes

akerouanton force-pushed the 45610-filter-by-input-iface branch from 50c391b to 2526269 Compare November 14, 2024 10:19

akerouanton requested a review from thaJeztah November 14, 2024 10:20

thaJeztah reviewed Nov 14, 2024

View reviewed changes

akerouanton commented Nov 14, 2024

View reviewed changes

akerouanton force-pushed the 45610-filter-by-input-iface branch from 2526269 to 0b39171 Compare November 14, 2024 10:55

thaJeztah approved these changes Nov 14, 2024

View reviewed changes

akerouanton mentioned this pull request Jan 10, 2025

Publishing ports explicitly to private networks should not be accessible from LAN hosts #45610

Closed

akerouanton force-pushed the 45610-filter-by-input-iface branch from 0b39171 to e6af2fc Compare January 13, 2025 16:34

akerouanton mentioned this pull request Jan 13, 2025

libnet/d/bridge: init driver.nlh in newDriver #49267

Merged

akerouanton force-pushed the 45610-filter-by-input-iface branch from e6af2fc to 08b696f Compare January 13, 2025 16:44

akerouanton force-pushed the 45610-filter-by-input-iface branch from 08b696f to 433b1f9 Compare January 13, 2025 18:04

akerouanton merged commit 6abba37 into moby:master Jan 14, 2025

akerouanton deleted the 45610-filter-by-input-iface branch January 14, 2025 12:04

This was referenced Jan 20, 2025

Revert "libnet/d/bridge: port mappings: filter by input iface" #49310

Merged

libnet/d/bridge: drop connections to lo mappings, and direct remote connections #49325

Merged

		// TODO(aker): we need to figure out what we want to do for rootlesskit.
		skip.If(t, testEnv.IsRootless, "rootlesskit has its own netns")

Conversation

akerouanton commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

robmry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robmry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akerouanton Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akerouanton commented Jan 14, 2025

Uh oh!

yrro commented Jan 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

akerouanton commented Oct 22, 2024 •

edited

Loading

akerouanton Nov 14, 2024 •

edited

Loading