add networkd/nspawn nftables backend#17026
Conversation
|
This pull request introduces 1 alert when merging f570f4a into e67b818 - view on LGTM.com new alerts:
|
|
I plan to fix the reported errors on Monday. I'll also do a build test on a Centos7 machine to make sure it builds there as well. |
poettering
left a comment
There was a problem hiding this comment.
would love a test case for this (even if it's a manual one, given that the CIs might not have nft), but i guess we had none before...
|
regarding the overlap issue: if people set things up that way things are broken a bit, no? maybe leave it that way hence, but propagate a recognizable error up (EUCLEAN?) and make networkd/nspawn print out a friendly explanation about it |
|
would love ipv6 support for this. for the nspawn case I am pretty sure it makes a ton of sense to hide the fact that an nspawn container is used in the back... |
|
regarding multiple messages in netlink support, i figure @yuwata knows this better by now than me... |
|
BTW, the ipv6 support is being asked for in #10254, but be excellent to close this at the same time |
|
Hmm, does this address #6106 too? if not, maybe it should? |
I think that should be done in an extra request. I'm also not sure this should be changed, the behaviour is there for some time. |
Unless there are extreme objections, I would prefer to do this in a different PR. HOWEVER, given that reply i would make sure that the next iteration would already pass the appropriate information (esp. the inet family) in function arguments internally so that code churn in sd-netlink is minimized. Does that sound reasonable? |
That's pretty unlikely, no? Might be relevant enough to mention in NEWS, but maybe too unlikely to make any bigger fuss about? Dunno?
Doing that in a separate PR is entirely OK. I am thankful for every issue closed, but it's entirely up to you if you want to do that, and when, and in how many PRs! |
|
Not reviewed yet. But please rebase. |
I've rebased and pushed my current working tree. Its till not 100% complete, I will work on this tomorrow. |
68cc1d8 to
4eaf4db
Compare
FWIW I have a (small) patch that adds ipv6 support. I would make another PR after this one is merged if thats OK. |
In case this would happen networkd would trigger this error message (already in master): |
|
I think you can risk a second look at this pr (head is 4eaf4db, "src: remove remaining libiptc entanglemen"). Feedback welcome, will try to address further change requests in a timely fashion. |
In various cases we have "per-proto bools", i.e. where Masquerade=no means off, =yes means on on both ipv4 and ipv6, and =ipv4 means ipv4-only an =ipv6 means ipv6-only. |
53bc3ea to
6afe21b
Compare
|
Needs another rebase |
|
Rebased, was only a minor context clash with 88fc9c9. |
| assert(prefixlen <= 32); | ||
| nplen = 32 - prefixlen; | ||
|
|
||
| mask = (1U << nplen) - 1U; |
There was a problem hiding this comment.
if npen is allowed to be 32 (i.e. prefixlen is allowed to be zero), then this will result in undefined behaviour, since 1U << 32 is not defined, and will overflow.
if prefixlen is now allowed to be zero, then maybe error out early?
(it looks to me as if a full ip range should be ok to support here though, hence maybe special case prefixlen == 0 here and allow it?)
There was a problem hiding this comment.
Its not allowed anymore, this will yield -EINVAL earlier.
The reason is two-fold:
- The caller is networkd, and afaics it always passes in the netmask configured on the interface.
- The existing iptables backend (which the nft one is modeled on) doesn't restrict based on interface. IOW, when using /0, it will masquerade everything on any interface, which is probably not wanted. If this should be supported, it might be best to add extra special code that just deletes the existing rule (which checks if a given address is in a range we're masquerading for) and replaces that with a single masquerade rule with no further checks.
|
|
||
| out_unref: | ||
| while (tsize > 0) | ||
| sd_netlink_message_unref(transaction[--tsize]); |
There was a problem hiding this comment.
in similar cases we added foobar_unref_many() wrappers, which take two params: the array and the size, and it goes through all entries and unrefs them. in this file you already have this loop twice, might be worth adding here too
There was a problem hiding this comment.
I was reluctant to add/expose a new sd-netlink api for this, but i can surely add it if thats wanted.
|
looks great, only minor stuff left that we can ignore – except for the netmask overflow thing which looks to me as if it is relevant and should be fixed. |
Make sure we don't add masquerading rules without a explicitly specified network range we should be masquerading for. The only caller aside from test case is networkd-address.c which never passes a NULL source. As it also passes the network prefix, that should always be > 0 as well. This causes expected test failure: Failed to modify firewall: Invalid argument Failed to modify firewall: Invalid argument Failed to modify firewall: Invalid argument Failed to modify firewall: Protocol not available Failed to modify firewall: Protocol not available Failed to modify firewall: Protocol not available Failed to modify firewall: Protocol not available The failing test cases are amended to expect failure on NULL source or prefix instead of success.
In a nutshell: 1. git mv firewall-util.c firewall-util-iptables.c 2. existing external functions gain _iptables_ in their names 3. firewall-util.c provides old function names 4. build system always compiles firewall-util.c, firewall-util-iptables.c is conditional instead (libiptc). 5. On first call to any of the 'old' API functions performs a probe that should return the preferred backend. In a future step, can add firewall-util-FOOTYPE.c, add its probe function to firewall-util.c and then have calls to fw_add_masq/local_dnat handed to the detected backend. For now, only iptables backend exists, and no special probing takes place for it, i.e. when systemd was built with iptables, that will be used. If not, requets to add masquerade/dnat will fail with same error (-EOPNOTSUPP) as before this change. For reference, the rules added by the libiptc/iptables backend look like this: for service export (via systemd-nspawn): [0:0] -A PREROUTING -p tcp -m tcp --dport $exportedport -m addrtype --dst-type LOCAL -j DNAT --to-destination $containerip:$port [0:0] -A OUTPUT ! -d 127.0.0.0/8 -p tcp -m tcp --dport $exportedport -m addrtype --dst-type LOCAL -j DNAT --to-destination $containerip:$port for ip masquerade: [0:0] -A POSTROUTING -s network/prefix -j MASQUERADE
Next patch will need to pass two pointers to the callback instead of just the addr mask. Caller will pass a compound structure, so make this 'void *userdata' to de-clutter the next patch.
for planned nft backend we have three choices: - open/close a new nfnetlink socket for every operation - keep a nfnetlink socket open internally - expose a opaque fw_ctx and stash all internal data here. Originally I opted for the 2nd option, but during review it was suggested to avoid static storage duration because of perceived problems with threaded applications. This adds fw_ctx and new/free functions, then converts the existing api and nspawn and networkd to use it.
Will be used/needed in the upcoming nfnetlink/nftables support. This follows existing model where kernel uapi headers are cached locally.
Will be used by upcoming nftables support -- it will use the netlink interface directly rather than add another library dependency.
add nfnetlink_nftables helper functions to: * open a new nfnetlink socket to kernel * add tables, chains, rules, sets and maps * delete/flush table * add and delete elements from sets/maps
nftables uses a transaction-based netlink model: one netlink write comes with multiple messages. A 'BEGIN' message to tell nf_tables/kernel that a new transaction starts. Then, one more messages to add/delete tables/chains/rules etc. Lastly, an END message that commits all changes. This function will be used to send all the individual messages that should make up a single transaction as a single write.
Will be used by nftables nfnetlink backend. It sends a series of netlink messages that form a nftables update transaction. The transaction will then generate a series of ack messages (or an error). This function will be used to read these acks.
Idea is to use a static ruleset, added when the first attempt to
add a masquerade or dnat rule is made.
The alternative would be to add the ruleset when the init function is called.
The disadvantage is that this enables connection tracking and NAT in the kernel
(as the ruleset needs this to work), which comes with some overhead that might
not be needed (no nspawn usage and no IPMasquerade option set).
There is no additional dependency on the 'nft' userspace binary or other libraries.
sd-netlinks nfnetlink backend is used to modify the nftables ruleset.
The commit message/comments still use nft syntax since that is what
users will see when they use the nft tool to list the ruleset.
The added initial skeleton (added on first fw_add_masquerade/local_dnat
call) looks like this:
table ip io.systemd.nat {
set masq_saddr {
type ipv4_addr
flags interval
elements = { 192.168.59.160/28 }
}
map map_port_ipport {
type inet_proto . inet_service : ipv4_addr . inet_service
elements = { tcp . 2222 : 192.168.59.169 . 22 }
}
chain prerouting {
type nat hook prerouting priority dstnat + 1; policy accept;
fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
}
chain output {
type nat hook output priority -99; policy accept;
ip daddr != 127.0.0.0/8 oif "lo" dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
}
chain postrouting {
type nat hook postrouting priority srcnat + 1; policy accept;
ip saddr @masq_saddr masquerade
}
}
Next calls to fw_add_masquerade/add_local_dnat will then only add/delete the
element/mapping to masq_saddr and map_port_ipport, i.e. the ruleset doesn't
change -- only the set/map content does.
Running test-firewall-util with this backend gives following output
on a parallel 'nft monitor':
$ nft monitor
add table ip io.systemd.nat
add chain ip io.systemd.nat prerouting { type nat hook prerouting priority dstnat + 1; policy accept; }
add chain ip io.systemd.nat output { type nat hook output priority -99; policy accept; }
add chain ip io.systemd.nat postrouting { type nat hook postrouting priority srcnat + 1; policy accept; }
add set ip io.systemd.nat masq_saddr { type ipv4_addr; flags interval; }
add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; }
add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat output ip daddr != 127.0.0.0/8 fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade
add element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
CTRL-C
Things not implemented/supported:
1. Change monitoring. The kernel allows userspace to learn about changes
made by other clients (using nfnetlink notifications). It would be
possible to detect when e.g. someone removes the systemd nat table.
This would need more work. Its also not clear on how to react to
external changes -- it doesn't seem like a good idea to just auto-undo
everthing.
2. 'set masq_saddr' doesn't handle overlaps.
Example:
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 16);
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 8); /* fails */
With the iptables backend the second call works, as it adds an
independent iptables rule.
With the nftables backend, the range 10.0.0.0-10.255.255.255 clashes with
the existing range of 10.0.0.0-10.0.255.255 so 2nd add gets rejected by the
kernel.
This will generate an error message from networkd ("Could not enable IP
masquerading: File exists").
To resolve this it would be needed to either keep track of the added elements
and perform range merging when overlaps are detected.
However, the add erquests are done using the configured network on a
device, so no overlaps should occur in normal setups.
IPv6 support is added in a extra changeset.
Fixes: systemd#13307
When someone runs 'nft flush ruleset' in the same net namespace this will also tear down the systemd nat table. Unlike iptables -t nat -F, which will remove all rules added by the systemd iptables backend, iptables has builtin chains that cannot be deleted. IOW, the next add operation will 'just work'. In the nftables case however, the entire table gets removed. When the systemd nat table is removed by an external entity next attempt to add a set element will yield -ENOENT. If this happens, recreate the table, and, if successful, re-do the add operation. Note that this doesn't protect against external sabotage such as a running 'while true; nft flush ruleset;done'. However, there is nothing that could be done short of extending the kernel to allow tables to be "frozen" or otherwise tied to a process such as systemd-networkd.
|
current incarnation (bc5a9b8) has following changes compared to last reviewed one:
I did not (yet?) add the new _unreg_many helper and did not special case (allow) 0 prefixlen so far. |
|
let's merge this, the rest we can figure out and tweak after the merge. excellent work! thank you! |
@fw-strlen If I wanted to make the necessary changes to make this work, what changes would be required? I tried removing the EDIT: nvm, I was missing a masquerade rule on traffic coming from localhost |
Since systemd/systemd#17026 (v248 in 2020), systemd can use nftables without any new dependency! In 259, systemd plans to remove iptables suport altogether.
This adds nftables/nfnetlink support to sd-netlink, then builds nftables support for firewall-util-nft backend on top of it.
See #13307 for the existing issue filed for this.
No new library dependencies are added. First patches streamline existing API by removing 'always-NULL' arguments that are not used. Then, existing firewall-util gets renamed so that it becomes possible to implement an alternative backend
to libiptc.
Next patches extend sd-netlink with a minimal nfnetlink backend.
It only supports the subset of nftables required to make masquerade/dnat work, i.e. what libiptc backend provides.
Last patch adds the nftables backend. Iindividual commit messages have more details.
Tested:
test-firewall-util works
VM that hosts a nspawn-container will
add expected dnat and masquerade rules:
$ nft monitor
[ container is started now ]
add table ip io.systemd.nat
add chain ip io.systemd.nat prerouting { type nat hook prerouting priority dstnat + 1; policy accept; }
add chain ip io.systemd.nat output { type nat hook output priority -99; policy accept; }
add chain ip io.systemd.nat postrouting { type nat hook postrouting priority srcnat + 1; policy accept; }
add set ip io.systemd.nat masq_saddr { type ipv4_addr; flags interval; }
add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; }
add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat output ip daddr != 127.0.0.0/8 oif "lo" dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade
new generation 2 by process 892 (systemd-network)
add element ip io.systemd.nat masq_saddr { 192.168.50.112/28 }
new generation 3 by process 892 (systemd-network)
add element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.50.121 . 22 }
new generation 4 by process 1404 (systemd-nspawn)
delete element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.50.121 . 22 }
[ container was shut down, then started again ]
new generation 5 by process 1404 (systemd-nspawn)
add element ip io.systemd.nat masq_saddr { 192.168.202.240-255.255.255.255 }
new generation 6 by process 892 (systemd-network)
add element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.202.249 . 22 }
new generation 7 by process 1492 (systemd-nspawn)
delete element ip io.systemd.nat map_port_ipport { tcp . 2222 : 192.168.202.249 . 22 }
new generation 8 by process 1492 (systemd-nspawn)
CTRL-C
Tested that I can connect from external machine to public-ip:2222 and get NAT'd to the container. Same works from within the VM to container. Using 127.0.0.1 does not work (just like with existing iptables backend).
Repeating same test with kernel that lacks support for nftables dnat results in addition of iptables rules.
Caveats:
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 16);
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 8); /* fails */
With the iptables backend the second call works; it will just add another iptables rule.
With the nftables backend, the range 10.0.0.0-10.255.255.255 clashes with the existing range of 10.0.0.0-10.0.255.255: 2nd add gets rejected by the kernel. To resolve this it would be needed to keep track of the added elements and perform range merging when overlaps are detected. If this is needed I can add this.
If there is a more appropriate existing API I should use please let me know.