Skip to content

add networkd/nspawn nftables backend#17026

Merged
poettering merged 11 commits intosystemd:masterfrom
fw-strlen:nft_16
Dec 16, 2020
Merged

add networkd/nspawn nftables backend#17026
poettering merged 11 commits intosystemd:masterfrom
fw-strlen:nft_16

Conversation

@fw-strlen
Copy link

This adds nftables/nfnetlink support to sd-netlink, then builds nftables support for firewall-util-nft backend on top of it.
See #13307 for the existing issue filed for this.

No new library dependencies are added. First patches streamline existing API by removing 'always-NULL' arguments that are not used. Then, existing firewall-util gets renamed so that it becomes possible to implement an alternative backend
to libiptc.

Next patches extend sd-netlink with a minimal nfnetlink backend.
It only supports the subset of nftables required to make masquerade/dnat work, i.e. what libiptc backend provides.

Last patch adds the nftables backend. Iindividual commit messages have more details.

Tested:

  1. test-firewall-util works

  2. VM that hosts a nspawn-container will
    add expected dnat and masquerade rules:
    $ nft monitor
    [ container is started now ]
    add table ip io.systemd.nat
    add chain ip io.systemd.nat prerouting { type nat hook prerouting priority dstnat + 1; policy accept; }
    add chain ip io.systemd.nat output { type nat hook output priority -99; policy accept; }
    add chain ip io.systemd.nat postrouting { type nat hook postrouting priority srcnat + 1; policy accept; }
    add set ip io.systemd.nat masq_saddr { type ipv4_addr; flags interval; }
    add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; }
    add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
    add rule ip io.systemd.nat output ip daddr != 127.0.0.0/8 oif "lo" dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
    add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade
    new generation 2 by process 892 (systemd-network)
    add element ip io.systemd.nat masq_saddr { 192.168.50.112/28 }
    new generation 3 by process 892 (systemd-network)
    add element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.50.121 . 22 }
    new generation 4 by process 1404 (systemd-nspawn)
    delete element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.50.121 . 22 }
    [ container was shut down, then started again ]
    new generation 5 by process 1404 (systemd-nspawn)
    add element ip io.systemd.nat masq_saddr { 192.168.202.240-255.255.255.255 }
    new generation 6 by process 892 (systemd-network)
    add element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.202.249 . 22 }
    new generation 7 by process 1492 (systemd-nspawn)
    delete element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.202.249 . 22 }
    new generation 8 by process 1492 (systemd-nspawn)
    CTRL-C

  3. Tested that I can connect from external machine to public-ip:2222 and get NAT'd to the container. Same works from within the VM to container. Using 127.0.0.1 does not work (just like with existing iptables backend).

  4. Repeating same test with kernel that lacks support for nftables dnat results in addition of iptables rules.

Caveats:

  1. ipv6 is not supported. This would be easy to add, but its not clear why one would need masquerade or dnat for ipv6 (existing iptables backend does not support it either).
  2. With nfnetlink it would be possible to detect when external entities make 'unexpected' changes to the systemd.nat table. Not implemented, as I have no idea how systemd-networkd should react to something like this. E.g. it doesn't seem like a good idea to just 'auto-undo' such changes.
  3. 'set masq_saddr' doesn't handle overlaps. Example:
    fw_add_masquerade(true, AF_INET, "10.0.0.0" , 16);
    fw_add_masquerade(true, AF_INET, "10.0.0.0" , 8); /* fails */

With the iptables backend the second call works; it will just add another iptables rule.

With the nftables backend, the range 10.0.0.0-10.255.255.255 clashes with the existing range of 10.0.0.0-10.0.255.255: 2nd add gets rejected by the kernel. To resolve this it would be needed to keep track of the added elements and perform range merging when overlaps are detected. If this is needed I can add this.

  1. The changeset adds two new API functions to sd-netlink to send multiple messages as one batch and to read netlink acks.
    If there is a more appropriate existing API I should use please let me know.

@lgtm-com
Copy link

lgtm-com bot commented Sep 11, 2020

This pull request introduces 1 alert when merging f570f4a into e67b818 - view on LGTM.com

new alerts:

  • 1 for Use of potentially dangerous function

@fw-strlen
Copy link
Author

I plan to fix the reported errors on Monday. I'll also do a build test on a Centos7 machine to make sure it builds there as well.
I've noticed there are leftover LIBIPTC defines in some spots. Also the fw test program no longer needs to depend the libiptc build dependency.

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love a test case for this (even if it's a manual one, given that the CIs might not have nft), but i guess we had none before...

@poettering
Copy link
Member

regarding the overlap issue: if people set things up that way things are broken a bit, no? maybe leave it that way hence, but propagate a recognizable error up (EUCLEAN?) and make networkd/nspawn print out a friendly explanation about it

@poettering
Copy link
Member

would love ipv6 support for this. for the nspawn case I am pretty sure it makes a ton of sense to hide the fact that an nspawn container is used in the back...

@poettering
Copy link
Member

regarding multiple messages in netlink support, i figure @yuwata knows this better by now than me...

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Sep 14, 2020
@poettering poettering linked an issue Sep 14, 2020 that may be closed by this pull request
@poettering
Copy link
Member

BTW, the ipv6 support is being asked for in #10254, but be excellent to close this at the same time

@poettering
Copy link
Member

Hmm, does this address #6106 too? if not, maybe it should?

@fw-strlen
Copy link
Author

Hmm, does this address #6106 too? if not, maybe it should?

I think that should be done in an extra request. I'm also not sure this should be changed, the behaviour is there for some time.
e.g. what if someone has a service listening on 127.0.0.1:12345 and there is a port forward for that from external:12345 to internal:42 ? If that is resolved then after update connect to 127.0.0.1:12345 ends up in the container.

@fw-strlen
Copy link
Author

BTW, the ipv6 support is being asked for in #10254, but be excellent to close this at the same time

Unless there are extreme objections, I would prefer to do this in a different PR. HOWEVER, given that reply i would make sure that the next iteration would already pass the appropriate information (esp. the inet family) in function arguments internally so that code churn in sd-netlink is minimized. Does that sound reasonable?

@poettering
Copy link
Member

I think that should be done in an extra request. I'm also not sure this should be changed, the behaviour is there for some time.
e.g. what if someone has a service listening on 127.0.0.1:12345 and there is a port forward for that from external:12345 to internal:42 ? If that is resolved then after update connect to 127.0.0.1:12345 ends up in the container.

That's pretty unlikely, no? Might be relevant enough to mention in NEWS, but maybe too unlikely to make any bigger fuss about? Dunno?

Unless there are extreme objections, I would prefer to do this in a different PR. HOWEVER, given that reply i would make sure that the next iteration would already pass the appropriate information (esp. the inet family) in function arguments internally so that code churn in sd-netlink is minimized. Does that sound reasonable?

Doing that in a separate PR is entirely OK. I am thankful for every issue closed, but it's entirely up to you if you want to do that, and when, and in how many PRs!

@yuwata
Copy link
Member

yuwata commented Sep 16, 2020

Not reviewed yet. But please rebase.

@poettering poettering added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Sep 16, 2020
@fw-strlen
Copy link
Author

Not reviewed yet. But please rebase.

I've rebased and pushed my current working tree. Its till not 100% complete, I will work on this tomorrow.
I've added two more commits, one commit to add a 'fw context' to avoid the need to keep a static nfnetlink handle in the nft backend and another one to rework the way ip_tables module gets loaded. The other commits have been changed mostly to address bugs and code format fixups, they do not have major changes. I will push another version+comment once i think all reported problems have been resolved.

@fw-strlen fw-strlen force-pushed the nft_16 branch 2 times, most recently from 68cc1d8 to 4eaf4db Compare September 18, 2020 21:03
@fw-strlen
Copy link
Author

BTW, the ipv6 support is being asked for in #10254, but be excellent to close this at the same time

Unless there are extreme objections, I would prefer to do this in a different PR. HOWEVER, given that reply i would make sure that the next iteration would already pass the appropriate information (esp. the inet family) in function arguments internally so that code churn in sd-netlink is minimized. Does that sound reasonable?

FWIW I have a (small) patch that adds ipv6 support. I would make another PR after this one is merged if thats OK.
Main issue is that it doesn't do anything yet -- at this time all callers only pass ipv4 addresses.
Do you have any preference on how to change networkd and nspawn? I.e., should there be new options to enable ipv6 explicitly ("IPV6Masquerade=true") or similar? No hurry, but something to keep in mind.

@fw-strlen
Copy link
Author

regarding the overlap issue: if people set things up that way things are broken a bit, no? maybe leave it that way hence, but propagate a recognizable error up (EUCLEAN?) and make networkd/nspawn print out a friendly explanation about it

In case this would happen networkd would trigger this error message (already in master):
r = address_establish(address, link);
if (r < 0)
log_link_warning_errno(link, r, "Could not enable IP masquerading, ignoring: %m");
(errno will be EEXIST).

@fw-strlen
Copy link
Author

I think you can risk a second look at this pr (head is 4eaf4db, "src: remove remaining libiptc entanglemen"). Feedback welcome, will try to address further change requests in a timely fashion.

@poettering
Copy link
Member

"IPV6Masquerade=true

In various cases we have "per-proto bools", i.e. where Masquerade=no means off, =yes means on on both ipv4 and ipv6, and =ipv4 means ipv4-only an =ipv6 means ipv6-only.

@fw-strlen fw-strlen force-pushed the nft_16 branch 2 times, most recently from 53bc3ea to 6afe21b Compare September 21, 2020 21:37
@bluca
Copy link
Member

bluca commented Dec 14, 2020

Needs another rebase

@fw-strlen
Copy link
Author

Rebased, was only a minor context clash with 88fc9c9.

assert(prefixlen <= 32);
nplen = 32 - prefixlen;

mask = (1U << nplen) - 1U;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if npen is allowed to be 32 (i.e. prefixlen is allowed to be zero), then this will result in undefined behaviour, since 1U << 32 is not defined, and will overflow.

if prefixlen is now allowed to be zero, then maybe error out early?

(it looks to me as if a full ip range should be ok to support here though, hence maybe special case prefixlen == 0 here and allow it?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not allowed anymore, this will yield -EINVAL earlier.
The reason is two-fold:

  1. The caller is networkd, and afaics it always passes in the netmask configured on the interface.
  2. The existing iptables backend (which the nft one is modeled on) doesn't restrict based on interface. IOW, when using /0, it will masquerade everything on any interface, which is probably not wanted. If this should be supported, it might be best to add extra special code that just deletes the existing rule (which checks if a given address is in a range we're masquerading for) and replaces that with a single masquerade rule with no further checks.


out_unref:
while (tsize > 0)
sd_netlink_message_unref(transaction[--tsize]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in similar cases we added foobar_unref_many() wrappers, which take two params: the array and the size, and it goes through all entries and unrefs them. in this file you already have this loop twice, might be worth adding here too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reluctant to add/expose a new sd-netlink api for this, but i can surely add it if thats wanted.

@poettering
Copy link
Member

looks great, only minor stuff left that we can ignore – except for the netmask overflow thing which looks to me as if it is relevant and should be fixed.

Florian Westphal added 9 commits December 16, 2020 00:35
Make sure we don't add masquerading rules without a explicitly
specified network range we should be masquerading for.

The only caller aside from test case is
networkd-address.c which never passes a NULL source.

As it also passes the network prefix, that should always be > 0 as well.

This causes expected test failure:
Failed to modify firewall: Invalid argument
Failed to modify firewall: Invalid argument
Failed to modify firewall: Invalid argument
Failed to modify firewall: Protocol not available
Failed to modify firewall: Protocol not available
Failed to modify firewall: Protocol not available
Failed to modify firewall: Protocol not available

The failing test cases are amended to expect failure on
NULL source or prefix instead of success.
In a nutshell:
1. git mv firewall-util.c firewall-util-iptables.c
2. existing external functions gain _iptables_ in their names
3. firewall-util.c provides old function names
4. build system always compiles firewall-util.c,
   firewall-util-iptables.c is conditional instead (libiptc).
5. On first call to any of the 'old' API functions performs
   a probe that should return the preferred backend.

In a future step, can add firewall-util-FOOTYPE.c, add its
probe function to firewall-util.c and then have calls to
fw_add_masq/local_dnat handed to the detected backend.

For now, only iptables backend exists, and no special probing
takes place for it, i.e. when systemd was built with iptables,
that will be used.  If not, requets to add masquerade/dnat will
fail with same error (-EOPNOTSUPP) as before this change.

For reference, the rules added by the libiptc/iptables backend look like this:

for service export (via systemd-nspawn):
[0:0] -A PREROUTING -p tcp -m tcp --dport $exportedport -m addrtype --dst-type LOCAL -j DNAT --to-destination $containerip:$port
[0:0] -A OUTPUT ! -d 127.0.0.0/8 -p tcp -m tcp --dport $exportedport -m addrtype --dst-type LOCAL -j DNAT --to-destination $containerip:$port

for ip masquerade:
[0:0] -A POSTROUTING -s network/prefix -j MASQUERADE
Next patch will need to pass two pointers to the callback instead
of just the addr mask.  Caller will pass a compound structure, so
make this 'void *userdata' to de-clutter the next patch.
for planned nft backend we have three choices:

- open/close a new nfnetlink socket for every operation
- keep a nfnetlink socket open internally
- expose a opaque fw_ctx and stash all internal data here.

Originally I opted for the 2nd option, but during review it was
suggested to avoid static storage duration because of perceived
problems with threaded applications.

This adds fw_ctx and new/free functions, then converts the existing api
and nspawn and networkd to use it.
Will be used/needed in the upcoming nfnetlink/nftables support.
This follows existing model where kernel uapi headers are cached
locally.
Will be used by upcoming nftables support -- it will use the netlink
interface directly rather than add another library dependency.
add nfnetlink_nftables helper functions to:
 * open a new nfnetlink socket to kernel
 * add tables, chains, rules, sets and maps
 * delete/flush table
 * add and delete elements from sets/maps
nftables uses a transaction-based netlink model: one netlink write
comes with multiple messages.

A 'BEGIN' message to tell nf_tables/kernel that a new transaction starts.

Then, one more messages to add/delete tables/chains/rules etc.

Lastly, an END message that commits all changes.

This function will be used to send all the individual messages that should
make up a single transaction as a single write.
Will be used by nftables nfnetlink backend.
It sends a series of netlink messages that form a nftables
update transaction.

The transaction will then generate a series of ack messages
(or an error).

This function will be used to read these acks.
Florian Westphal added 2 commits December 16, 2020 01:07
Idea is to use a static ruleset, added when the first attempt to
add a masquerade or dnat rule is made.

The alternative would be to add the ruleset when the init function is called.
The disadvantage is that this enables connection tracking and NAT in the kernel
(as the ruleset needs this to work), which comes with some overhead that might
not be needed (no nspawn usage and no IPMasquerade option set).

There is no additional dependency on the 'nft' userspace binary or other libraries.
sd-netlinks nfnetlink backend is used to modify the nftables ruleset.

The commit message/comments still use nft syntax since that is what
users will see when they use the nft tool to list the ruleset.

The added initial skeleton (added on first fw_add_masquerade/local_dnat
call) looks like this:

table ip io.systemd.nat {
        set masq_saddr {
                type ipv4_addr
                flags interval
                elements = { 192.168.59.160/28 }
        }

        map map_port_ipport {
                type inet_proto . inet_service : ipv4_addr . inet_service
                elements = { tcp . 2222 : 192.168.59.169 . 22 }
        }

        chain prerouting {
                type nat hook prerouting priority dstnat + 1; policy accept;
                fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
        }

        chain output {
                type nat hook output priority -99; policy accept;
                ip daddr != 127.0.0.0/8 oif "lo" dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
        }

        chain postrouting {
                type nat hook postrouting priority srcnat + 1; policy accept;
                ip saddr @masq_saddr masquerade
        }
}

Next calls to fw_add_masquerade/add_local_dnat will then only add/delete the
element/mapping to masq_saddr and map_port_ipport, i.e. the ruleset doesn't
change -- only the set/map content does.

Running test-firewall-util with this backend gives following output
on a parallel 'nft monitor':

$ nft monitor
add table ip io.systemd.nat
add chain ip io.systemd.nat prerouting { type nat hook prerouting priority dstnat + 1; policy accept; }
add chain ip io.systemd.nat output { type nat hook output priority -99; policy accept; }
add chain ip io.systemd.nat postrouting { type nat hook postrouting priority srcnat + 1; policy accept; }
add set ip io.systemd.nat masq_saddr { type ipv4_addr; flags interval; }
add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; }
add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat output ip daddr != 127.0.0.0/8 fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade
add element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
CTRL-C

Things not implemented/supported:
1. Change monitoring.  The kernel allows userspace to learn about changes
   made by other clients (using nfnetlink notifications). It would be
   possible to detect when e.g. someone removes the systemd nat table.
   This would need more work.  Its also not clear on how to react to
   external changes -- it doesn't seem like a good idea to just auto-undo
   everthing.
2. 'set masq_saddr' doesn't handle overlaps.
   Example:

   fw_add_masquerade(true, AF_INET, "10.0.0.0" , 16);
   fw_add_masquerade(true, AF_INET, "10.0.0.0" , 8); /* fails */

With the iptables backend the second call works, as it adds an
independent iptables rule.

With the nftables backend, the range 10.0.0.0-10.255.255.255 clashes with
the existing range of 10.0.0.0-10.0.255.255 so 2nd add gets rejected by the
kernel.

This will generate an error message from networkd ("Could not enable IP
masquerading: File exists").

To resolve this it would be needed to either keep track of the added elements
and perform range merging when overlaps are detected.

However, the add erquests are done using the configured network on a
device, so no overlaps should occur in normal setups.

IPv6 support is added in a extra changeset.

Fixes: systemd#13307
When someone runs 'nft flush ruleset' in the same net namespace
this will also tear down the systemd nat table.

Unlike iptables -t nat -F, which will remove all rules added by
the systemd iptables backend, iptables has builtin chains that cannot
be deleted. IOW, the next add operation will 'just work'.

In the nftables case however, the entire table gets removed.

When the systemd nat table is removed by an external entity next
attempt to add a set element will yield -ENOENT.

If this happens, recreate the table, and, if successful, re-do
the add operation.

Note that this doesn't protect against external sabotage such as
a running 'while true; nft flush ruleset;done'. However, there is
nothing that could be done short of extending the kernel to allow
tables to be "frozen" or otherwise tied to a process such as
systemd-networkd.
@fw-strlen
Copy link
Author

current incarnation (bc5a9b8) has following changes compared to last reviewed one:

  1. rebase on master (no conflicts)
  2. remove two blank lines
  3. add commit message of topmost commit also as comment as per Lennarts suggestion

I did not (yet?) add the new _unreg_many helper and did not special case (allow) 0 prefixlen so far.

@poettering
Copy link
Member

let's merge this, the rest we can figure out and tweak after the merge.

excellent work! thank you!

@poettering poettering merged commit a8af734 into systemd:master Dec 16, 2020
@daandemeyer
Copy link
Collaborator

daandemeyer commented Dec 24, 2020

I think that should be done in an extra request. I'm also not sure this should be changed, the behaviour is there for some time.
e.g. what if someone has a service listening on 127.0.0.1:12345 and there is a port forward for that from external:12345 to internal:42 ? If that is resolved then after update connect to 127.0.0.1:12345 ends up in the container.

@fw-strlen If I wanted to make the necessary changes to make this work, what changes would be required? I tried removing the ip daddr != 127.0.0.1/8 rule from the output chain but that's not enough it seems. I then tried to enable net.ipv4.conf.ve-image.route_localnet=1 for the container's veth link host interface but that didn't do it either. I did go from a hanging ssh command without disabling route_localnet to getting ssh: connect to host localhost port 8729: No route to host when disabling it.

EDIT: nvm, I was missing a masquerade rule on traffic coming from localhost

@nikstur nikstur mentioned this pull request Aug 9, 2025
13 tasks
nikstur added a commit to nikstur/nixpkgs that referenced this pull request Aug 10, 2025
Since systemd/systemd#17026 (v248 in 2020),
systemd can use nftables without any new dependency!

In 259, systemd plans to remove iptables suport altogether.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Port from iptables (libiptc) to nftables (libnftl or libnftables)

6 participants