Skip to content

libnet/d/bridge: port mappings: filter by input iface#48721

Merged
akerouanton merged 1 commit intomoby:masterfrom
akerouanton:45610-filter-by-input-iface
Jan 14, 2025
Merged

libnet/d/bridge: port mappings: filter by input iface#48721
akerouanton merged 1 commit intomoby:masterfrom
akerouanton:45610-filter-by-input-iface

Conversation

@akerouanton
Copy link
Copy Markdown
Member

@akerouanton akerouanton commented Oct 22, 2024

- What I did

When a NAT-based port mapping is created with a HostIP specified, we insert a DNAT rule in nat DOCKER to replace the dest addr with the container IP. Then, in filter chains, we allow access to the container port for any packet not coming from the container's network itself (if hairpinning is disabled), nor from another host bridge.

However we don't set any rule that prevent a rogue neighbor located in another L2 segment / subnet from sending packets destined to that HostIP.

For instance, if a port binding is created with HostIP == '127.0.0.1', this port should not be accessible from anything but the lo interface. That's currently not the case and this provides a false sense of security.

Since nat-DOCKER mangles the dest addr, and the nat table rejects DROP rules, this change adds rules into raw-PREROUTING to filter ingress packets destined to mapped ports based on the input interface, the dest addr and the dest port.

Interfaces are dynamically resolved when packets hit the host, thanks to iptables' addrtype extension. This extension does a fib lookup of the dest addr and checks that it's associated with the interface reached.

Also, when a proxy-based port mapping is created, as is the case when an IPv6 HostIP is specified but the container is only IPv4-capable, we don't set any sort of filtering. So the same issue might happen. The reason is a bit different - in that case, that's just how the kernel works. But, in order to stay consistent with NAT-based mappings, these rules are also applied.

The env var DOCKER_DISABLE_INPUT_IFACE_FILTERING can be set to any true-ish value to globally disable this behavior.

- How to verify it

A new regression / integration test TestAccessPublishedPortFromNonMatchingIface checks that remote access is correctly disallowed for ports published on loopback and routable addresses.

- Description for the changelog

- Add a couple of iptables rules to filter on the input interface for NAT port mappings. This will prevent rogue neighboring hosts from accessing port mappings that aren't published in the same subnet / L2 segment.
  - The env var `DOCKER_DISABLE_INPUT_IFACE_FILTERING` can be set to any `true`-ish value to globally disable this filtering. This is a temporary escape hatch and will be removed in a future release. Please report any issue if you need to use it.

Comment thread internal/testutils/networking/host_addr_linux.go Outdated
Copy link
Copy Markdown
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed - maybe worth clarifying in the commit message that the attacker needs to share an L2 segment / subnet with one of the docker host's interfaces (and it's the attacked interface that's on a different segment).

Comment thread internal/testutils/networking/host_addr_linux.go Outdated
stdOut := strings.TrimSpace(actualStdout.String())
assert.Assert(t, !strings.Contains(stdOut, "foobar"))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd probably be worth adding a test that's expected to succeed? (By sending to the "eth-test" addresses I think?)

That'd help to confirm the packet's not getting through in the other tests because of the new rule - rather than the container log read being too quick, or some other test problem?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also run the tests twice, with the escape hatch env-var set and unset?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests now run with and without the escape hatch, and all test cases also run the 'happy case' to make sure the test is working as expected.

Comment thread libnetwork/drivers/bridge/port_mapping_linux.go
Comment thread libnetwork/drivers/bridge/port_mapping_linux.go Outdated
Comment thread integration/internal/network/ops.go Outdated

// WithIPv6 Enables IPv6 on the network
func WithIPv6() func(*network.CreateOptions) {
func WithIPv6(enable bool) func(*network.CreateOptions) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly on the fence on adding the argument (but could use input from others); perhaps a WithDisableIPv6() could make it easier to find places where we disable, and we could keep the WithIPv6 as-is.

Also curious what is the default if not set? (we don't have an option to reset, correct?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added WithDisableIPv6() and removed the commit that was changing WithIPv6() signature.

@akerouanton akerouanton force-pushed the 45610-filter-by-input-iface branch from 402268c to 50c391b Compare November 8, 2024 10:17
Copy link
Copy Markdown
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - except the new iptables doc needs to be manually added to integration/network/bridge/iptablesdoc/index.md.

Comment on lines +890 to +900
if v, _ := strconv.ParseBool(os.Getenv("DOCKER_DISABLE_INPUT_IFACE_FILTERING")); v {
return nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions about this env-var;

  • Do we consider this one to be permanent or temporary ?
  • Should we have a log entry to indicate this option was set?
  • If we consider it temporary, have a warning added to docker info ? (check if it's set, and if so, add a warning to the list of warnings returned)?

Copy link
Copy Markdown
Member Author

@akerouanton akerouanton Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider this one to be permanent or temporary ?

That's a temporary escape hatch in case the new filtering does break someone, somewhere.

Should we have a log entry to indicate this option was set?

I added an info log to that code path. Now, whenever users create a container with NAT-based port-mappings, they'll get this logged:

DOCKER_DISABLE_INPUT_IFACE_FILTERING is set, skipping input iface filtering

If we consider it temporary, have a warning added to docker info ? (check if it's set, and if so, add a warning to the list of warnings returned)?

I added these two warnings to make it clear that it's insecure, and that it'll be removed at a later time.

WARNING: input interface filtering is disabled on port mappings, this might be insecure
DEPRECATED: DOCKER_DISABLE_INPUT_IFACE_FILTERING is deprecated and will be removed in a future release

@akerouanton akerouanton force-pushed the 45610-filter-by-input-iface branch from 50c391b to 2526269 Compare November 14, 2024 10:19
// DOCKER_DISABLE_INPUT_IFACE_FILTERING can be used as an escape hatch if
// this filtering doesn't work out well for some users.
if v, _ := strconv.ParseBool(os.Getenv("DOCKER_DISABLE_INPUT_IFACE_FILTERING")); v {
log.G(context.TODO()).Info("DOCKER_DISABLE_INPUT_IFACE_FILTERING is set, skipping input iface filtering.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'd even be ok with Warn for this if we consider it "bad", but Info works for me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's now a Warn.

Comment thread Dockerfile
libprotobuf-c1 \
libyajl2 \
net-tools \
netcat-openbsd \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only for our tests, or is this a new dependency we should have installed (and to be added as "requires" in our packages?)

I can't find a direct reference to this in this PR; was it accidentally left behind (for local debugging?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// not reach the container (ie. the container returns an ICMP
// 'Port Unreachable' message).
time.Sleep(50 * time.Millisecond)
icmd.RunCommand("/bin/sh", "-c", fmt.Sprintf("echo foobar | nc -w1 -u %s %s", hostAddr, hostPort)).Assert(t, icmd.Success)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK 🙈 I went looking for netcat not nc; @akerouanton pointed me it's used here, and only needed for testing, so no packaging changes required.

Comment on lines +763 to +798
// TODO(aker): we need to figure out what we want to do for rootlesskit.
skip.If(t, testEnv.IsRootless, "rootlesskit has its own netns")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkihiroSuda I'm wondering what we should do for rootless mode.

On one hand, accepting packets with a given dest addr from any interface is just how Linux sockets work. So this issue also exists with rootless mode I think.

OTOH, this proved to be surprising for many users. The biggest issue is for loopback addresses, ie. in non-rootless mode we'd happily forward packets with a dest addr in 127.0.0.0/8 and coming from any interface. Rootless mode shouldn't be susceptible to this security issue (because the iptables rules that cause this issue don't live in the host netns).

I'm wondering if rootless / non-rootless modes should diverge from each other (one protecting, and the other not protecting against that issue). Or if we should diverge from Linux sockets' semantics in both cases. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what is the attack scenario?

Rootless mode has its own 127.0.0.0/8 (until #47103 is merged) that is isolated from the host-side 127.0.0.0/8

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different attack scenarios.

First one involves ports bound to one of the loopback address. In that case, the attacker could send packets with the loopback address set as the daddr, and the target Docker host would accept them. That's not a problem for rootless mode because the port mapping is proxied by rootlesskit, and thus packets are correctly filtered by the kernel by default.

The second attack scenario involves ports mapped to a specific subnet, and an attacker part of another subnet. In that case, the attacker could send packets destined to the IP address where ports are mapped, without being part of that subnet, and the victim Docker host would accept them too. Whether it's a security issue is debatable because that's just how networking works on Linux. However, it can be fixed relatively easily in rootless mode through an iptables rule, but that's harder for a rootful Engine as you need to insert rule(s) in chains managed by Docker -- so we considered this as a vulnerability too.

But now with the current fix, that means rootful docker will be protected against both attack scenarios, and rootless mode will be affected by one.

So either rootful mode diverges from default kernel behavior, introducing a discrepancy between rootless and rootful mode, but keeping users at bay.

Or we could consider the 2nd scenario as not being a security issue, but then users are on their own to prevent such access patterns - and that means manually setting up iptables rules, ensuring these rules are always correctly positioned in the chains, etc...

@akerouanton akerouanton force-pushed the 45610-filter-by-input-iface branch from 2526269 to 0b39171 Compare November 14, 2024 10:55
Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// For instance, if HostIP == 127.0.0.1, no ingress should come from anything
// but lo.
func (n *bridgeNetwork) filterPortByInputIface(b portBinding, enable bool) error {
hostIP := b.childHostIP
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (no need to change unless other changes are needed); this could've been put closer to where it's used (also because it may not be used at all due to early returns)

Comment on lines +825 to +827
d := newDriver()
d.nlh, err = nlwrap.NewHandle()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qNot for this PR, but I noticed the weird newDriver() / driver.configure() design, which is a bit .. odd, as it looks like driver.configure() is the actual constructor, and newDriver() is mostly a useless thing; (almost?) all locations where it's used, look like;

d := newDriver()
if err := d.configure(options); err != nil {
 ....
}

We should consider making newDriver accept the options (if my interpretation of the flow is correct); looks like this was the commit introducing some of it; d565a4d, although it was there in some form before that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was addressed in #49158, but I now realize that d.nlh should probably be initialized in newDriver too. I'll open another PR to fix that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#49267 was merged; should this line be removed now?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh; this is NewHandle() not NlHandle(), so not same I guess

When a NAT-based port mapping is created with a HostIP specified, we
insert a DNAT rule in nat-DOCKER to replace the dest addr with the
container IP. Then, in filter chains, we allow access to the container
port for any packet not coming from the container's network itself (if
hairpinning is disabled), nor from another host bridge.

However we don't set any rule that prevents a rogue neighbor that shares
a L2 segment with the host, but not the one where the port binding is
expected to be published, from sending packets destined to that HostIP.

For instance, if a port binding is created with HostIP == '127.0.0.1',
this port should not be accessible from anything but the lo interface.
That's currently not the case and this provides a false sense of
security.

Since nat-DOCKER mangles the dest addr, and the nat table rejects DROP
rules, this change adds rules into raw-PREROUTING to filter ingress
packets destined to mapped ports based on the input interface, the dest
addr and the dest port.

Interfaces are dynamically resolved when packets hit the host, thanks
to iptables' addrtype extension. This extension does a fib lookup of the
dest addr and checks that it's associated with the interface reached.

Also, when a proxy-based port mapping is created, as is the case when an
IPv6 HostIP is specified but the container is only IPv4-capable, we
don't set any sort of filtering. So the same issue might happen. The
reason is a bit different - in that case, that's just how the kernel
works. But, in order to stay consistent with NAT-based mappings, these
rules are also applied.

The env var `DOCKER_DISABLE_INPUT_IFACE_FILTERING` can be set to any
true-ish value to globally disable this behavior.

Signed-off-by: Albin Kerouanton <[email protected]>
@akerouanton akerouanton force-pushed the 45610-filter-by-input-iface branch from 08b696f to 433b1f9 Compare January 13, 2025 18:04
@akerouanton
Copy link
Copy Markdown
Member Author

Last force-pushes were only rebase, so let me merge it.

@akerouanton akerouanton merged commit 6abba37 into moby:master Jan 14, 2025
@akerouanton akerouanton deleted the 45610-filter-by-input-iface branch January 14, 2025 12:04
@yrro
Copy link
Copy Markdown

yrro commented Jan 14, 2025

This is excellent. Thanks @akerouanton and everyone else who worked on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publishing ports explicitly to private networks should not be accessible from LAN hosts

5 participants