nspawn: R/W support for /sysfs, /proc, and /proc/sys/net#4395
nspawn: R/W support for /sysfs, /proc, and /proc/sys/net#4395poettering merged 2 commits intosystemd:masterfrom
Conversation
| { NULL, "/sys", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_FATAL|MOUNT_RO }, | ||
| { "tmpfs", "/dev", "tmpfs", "mode=755", MS_NOSUID|MS_STRICTATIME, MOUNT_FATAL }, | ||
| { "tmpfs", "/dev/shm", "tmpfs", "mode=1777", MS_NOSUID|MS_NODEV|MS_STRICTATIME, MOUNT_FATAL }, | ||
| { "tmpfs", "/run", "tmpfs", "mode=755", MS_NOSUID|MS_NODEV|MS_STRICTATIME, MOUNT_FATAL }, |
There was a problem hiding this comment.
This mount table looked, and looks pretty complicated, but I didn't have a better idea on how to refactor this to be more readable.
|
/cc @lucab |
poettering
left a comment
There was a problem hiding this comment.
looks ok conceptually, just a few more chnages please!
src/nspawn/nspawn-mount.h
Outdated
| typedef enum MountSettingsMask { | ||
| MOUNT_FATAL = 1 << 0, | ||
| MOUNT_IN_USERNS = 1 << 1, | ||
| MOUNT_USE_USERNS = 1 << 3, |
There was a problem hiding this comment.
if we have these nice flags, I think we should also cease the chance and document them in a nice way, i.e. add some /* */ explanation here to the end of relevant flags. I figure most flags are pretty much self-explanatory, but I think the namespacing-related options could really use a brief explanation, i.e. what the difference between "IN_USERNS" and "USE_USERNS" actually is...
src/nspawn/nspawn-mount.h
Outdated
| MOUNT_IN_USERNS = 1 << 1, | ||
| MOUNT_USE_USERNS = 1 << 3, | ||
| MOUNT_USE_NETNS = 1 << 2, | ||
| MOUNT_RO = 1 << 4, |
There was a problem hiding this comment.
Hmm, the RO flag is a bit misleadingly named, no? after all it doesn't control read-only vs. writable. Instead it is simply a mask for the actual RO bit used. As such, it should probably named MASK_RO instead of just RO, no?
(or did I misunderstand this?)
There was a problem hiding this comment.
Semantically it behaves like MOUNT_USE_NETNS omitting those mounts if unset. So how about if we call both MOUNT_MASK_NETNS, and MOUNT_MASK_RO?
src/nspawn/nspawn.c
Outdated
| parse_share_ns_env("SYSTEMD_NSPAWN_SHARE_SYSTEM", CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS); | ||
|
|
||
| if (getenv_bool("SYSTEMD_NSPAWN_MOUNT_RW") <= 0) | ||
| mount_settings |= MOUNT_RO; |
There was a problem hiding this comment.
I'd probably turn this around, and set mount_settings to MOUNT_RO when declaring the variable (instead of setting it to 0 then), and then remove the bit, if the env var here is set. I think it's a good idea to have the static arg_xyz variables all initialized to their actual defaults by default, in order to make things more obvious to readers.
src/nspawn/nspawn.c
Outdated
| static bool arg_notify_ready = false; | ||
| static bool arg_use_cgns = true; | ||
| static unsigned long arg_clone_ns_flags = CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS; | ||
| static MountSettingsMask mount_settings = 0; |
There was a problem hiding this comment.
These global, static variables that directly relate to cmdline arguments (or in some case, including this one, to env vars), should all be prefixed with the "arg_" prefix, to indicate that their are process arguments instead of internal state. We followed that pretty strictly so far in our tree, and should do so here, I think
src/nspawn/nspawn-mount.c
Outdated
| bool in_userns = (mount_table[k].mount_settings & MOUNT_IN_USERNS) > 0; | ||
| bool use_userns = (mount_table[k].mount_settings & MOUNT_USE_USERNS) > 0; | ||
|
|
||
| if (((mount_settings ^ mount_table[k].mount_settings) & MOUNT_IN_USERNS) > 0) |
There was a problem hiding this comment.
why the > 0 check here? you can just drop that and rely on C's downgrade-to-bool feature here. After all, this is actually a bool-like expression that you are using here, hence using it as a bool if you can makes a ton of sense.
src/nspawn/nspawn-mount.c
Outdated
| o = mount_table[k].options; | ||
| if (streq_ptr(mount_table[k].type, "tmpfs")) { | ||
| if (in_userns) | ||
| if (in_userns > 0) |
There was a problem hiding this comment.
in_userns is a bool, it's strange doing a greater-than comparison with bools...
src/nspawn/nspawn-mount.c
Outdated
| remount_flags |= MS_RDONLY; | ||
| } | ||
|
|
||
| r = mount_verbose(LOG_ERR, "sysfs", full, "sysfs", sysfs_flags, NULL); |
There was a problem hiding this comment.
I'd really do this the other way round: leave the literal flags parameter being passed to mount_verbose() in place, but add an |extra to the ORed flags as last step. Then, simply define extra as long, and initialize it based on the MOUNT_RO flag, and reuse it for both cases. Such a change would be much shorter, and only needs one extra var, instead of two...
src/nspawn/nspawn.c
Outdated
| mount_settings |= MOUNT_USE_NETNS; | ||
|
|
||
| if (getenv_bool("SYSTEMD_NSPAWN_USE_NETNS") <= 0) | ||
| mount_settings |= MOUNT_USE_NETNS; |
There was a problem hiding this comment.
not sure I get this one. What's the precise usecase here? Why wouldn't one use --private-network here?
There was a problem hiding this comment.
--private-network creates its own network namespace, which we would like to omit here. The use case is rkt+an external network configurator (e.g. CNI), where we already have an external network namespace, join it, and have /proc/sys/net read-write.
There was a problem hiding this comment.
so, what precisely is left of --private-network's functionality if you use $SYSTEMD_NSPAWN_USE_NETNS? Only the fact that /proc/sys/net remains writable?
So, summarizing your patch set with $SYSTEMD_NSPAWN_MOUNT_RW you get all of /proc/, /sys writable, but with $SYSTEMD_NSPAWN_USE_NETNS you get only /proc/sys/net writable but everything else still read-only? If so, the latter option is a true subset of the former? If so, maybe this should not be two bool env vars, but instead one that takes an enum? Maybe $SYSTEMD_API_VFS_WRITABLE=auto|yes|no|network or so?
There was a problem hiding this comment.
SYSTEMD_API_VFS_WRITABLE sounds good to me, I'll head this route.
|
@poettering thanks for the swift review! I'll address the comments for the next round (and apparently something broke my patch in the last 4 days) |
|
Xenial CentOS BTW: there is a smoke test for systemd-nspawn: #4378
I think we should add |
|
@evverx Thanks for the test results, I made indeed a mistake enabling I am fine with #4378 merging first, and will try to add a smoke test for |
Feel free to ping me (but, actually, I don't know how to pass that smoke-test: #4372 (comment) :-)) |
398f610 to
3dd3bc9
Compare
src/nspawn/nspawn.c
Outdated
| else if (streq(e, "no")) | ||
| arg_mount_settings |= MOUNT_MASK_RO; | ||
| else if (streq(e, "network")) | ||
| arg_mount_settings |= MOUNT_MASK_NETNS; |
There was a problem hiding this comment.
So, I think we should use the "extended bool" logic to parse this, which we frequently use in other cases like this. Specifically, this means checking first if the parameter has the "network" value. And if not, then fall back to parsing it as a boolean via parse_boolean(). That way, people can treat it as a usual boolean (and also use "0", "off" and "false" as negative value, and "1", "on", "true" as positive value), except that it understands one more, special value, which is "network".
I also think this should generate a warning message, when we don't understand the passed parameter. i.e. use log_warning() and say that the value set in the env var is not understood and will be ignored.
src/nspawn/nspawn.c
Outdated
| return; | ||
|
|
||
| if (streq(e, "yes")) | ||
| arg_mount_settings &= ~MOUNT_MASK_RO; |
There was a problem hiding this comment.
this should not only unset MOUNT_MASK_RO, but also MOUNT_MASK_NETNS
There was a problem hiding this comment.
yes, indeed, that's a good idea.
src/nspawn/nspawn.c
Outdated
| if (streq(e, "yes")) | ||
| arg_mount_settings &= ~MOUNT_MASK_RO; | ||
| else if (streq(e, "no")) | ||
| arg_mount_settings |= MOUNT_MASK_RO; |
There was a problem hiding this comment.
this should not only set MOUNT_MASK_RO, but also unset MOUNT_MASK_NETNS, no?
src/nspawn/nspawn-mount.h
Outdated
| MOUNT_USE_USERNS = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */ | ||
| MOUNT_IN_USERNS = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */ | ||
| MOUNT_MASK_NETNS = 1 << 3, /* masks mounts to be executed in a network namespace in addition to all other mounts */ | ||
| MOUNT_MASK_RO = 1 << 4, /* masks mounts to be executed read-only in addition to all other mounts */ |
There was a problem hiding this comment.
i am a bit confused by the latter two. do these flags now turn the relevant mount operations off? or do they turn them on? the comments need to be clearer about this.
If i got this right, the MASK_RO set means "/proc/sys is read-only", correct? And if MASK_NETNS is set that means "/proc/sys/net" is read-only, right?
Thinking about this I have the suspicion that "mask" is actually a weird word here and we really shouldn't use it. After all, we use these flags both for the "mask" that is set in the env var, but also for the flags in the mount table. And one is a mask, the other a condition i would say...
Maybe we should call them "MOUNT_APPLY_APIVFS_RO" and "MOUNT_APPLY_APIVFS_NETNS" or so... dunno.
sorry for the confusoin around this...
There was a problem hiding this comment.
i am a bit confused by the latter two. do these flags now turn the relevant mount operations off? or do they turn them on? the comments need to be clearer about this.
Yes, they are turning features "on", I'll enhance the comments.
If i got this right, the MASK_RO set means "/proc/sys is read-only", correct?
yes
And if MASK_NETNS is set that means "/proc/sys/net" is read-only, right?
no, that means "/proc/sys/net" will remain read-write.
Maybe we should call them "MOUNT_APPLY_APIVFS_RO" and "MOUNT_APPLY_APIVFS_NETNS" or so... dunno.
agreed
sorry for the confusoin around this...
No worries :-)
systemd-nspawn -D ./cont -U -b
...
Selected user namespace base 1605697536 and range 65536.
Timezone Etc/UTC does not exist in container, not updating container timezone.
Failed to write boot id: Permission deniedSeems like the outer child doesn't patch tmpfs options: Should be something like |
|
@evverx gah, thanks, already spotted my mistake. |
3dd3bc9 to
d824fdd
Compare
src/nspawn/nspawn-mount.c
Outdated
| int mount_sysfs(const char *dest, MountSettingsMask mount_settings) { | ||
| const char *full, *top, *x; | ||
| int r; | ||
| long extra_flags = 0; |
| { NULL, "/proc/sys", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_FATAL|MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO }, /* ... then, make it r/o */ | ||
| { "/proc/sysrq-trigger", "/proc/sysrq-trigger", NULL, NULL, MS_BIND, MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO }, /* Bind mount first ...*/ | ||
| { NULL, "/proc/sysrq-trigger", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO }, /* ... then, make it r/o */ | ||
| { "tmpfs", "/tmp", "tmpfs", "mode=1777", MS_STRICTATIME, MOUNT_FATAL|MOUNT_IN_USERNS }, |
There was a problem hiding this comment.
Thanks, this table is finally legible.
src/nspawn/nspawn-mount.c
Outdated
| const char *o; | ||
| bool fatal = (mount_table[k].mount_settings & MOUNT_FATAL); | ||
|
|
||
| if ((mount_settings ^ mount_table[k].mount_settings) & MOUNT_IN_USERNS) |
There was a problem hiding this comment.
Dunno, I don't find those bitmask manipulations very clear. I think it would be easier to read if this was written as
if ((mount_settings & MOUNT_IN_USERNS) != (mount_table[k].mount_settings & MOUNT_IN_USERNS))
|
|
||
| if (!use_netns && mount_table[k].use_netns) | ||
| if (~mount_settings & mount_table[k].mount_settings & MOUNT_APPLY_APIVFS_RO) | ||
| continue; |
There was a problem hiding this comment.
A helper variable would make the intent clear:
bool leftover_flags = ~mount_settings & mount_table[k].mount_settings;
if (leftover_flags & MOUNT_APPLY_APIVFS_NETNS ||
leftover_flags & MOUNT_APPLY_APIVFS_RO)
continue;
src/nspawn/nspawn-mount.h
Outdated
| #include "cgroup-util.h" | ||
|
|
||
| typedef enum MountSettingsMask { | ||
| MOUNT_FATAL = 1 << 0, /* if set, a mount error are considered fatal */ |
src/nspawn/nspawn-mount.h
Outdated
| MOUNT_FATAL = 1 << 0, /* if set, a mount error are considered fatal */ | ||
| MOUNT_USE_USERNS = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */ | ||
| MOUNT_IN_USERNS = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */ | ||
| MOUNT_APPLY_APIVFS_RO = 1 << 3, /* if set, /proc/sys, and /sysfs will be mounted read-only, else read-write. */ |
src/nspawn/nspawn-mount.h
Outdated
| MOUNT_USE_USERNS = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */ | ||
| MOUNT_IN_USERNS = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */ | ||
| MOUNT_APPLY_APIVFS_RO = 1 << 3, /* if set, /proc/sys, and /sysfs will be mounted read-only, else read-write. */ | ||
| MOUNT_APPLY_APIVFS_NETNS = 1 << 4, /* if set, /proc/sys/net will be mounted read-write. Works only if MOUNT_APPLY_APIVFS_NETNS is also set. */ |
src/nspawn/nspawn.c
Outdated
| arg_clone_ns_flags = (arg_clone_ns_flags & ~ns_flag) | (r > 0 ? 0 : ns_flag); | ||
| } | ||
|
|
||
|
|
src/nspawn/nspawn.c
Outdated
| return; | ||
| } | ||
|
|
||
| r = getenv_bool("SYSTEMD_NSPAWN_API_VFS_WRITABLE"); |
There was a problem hiding this comment.
It's a minor thing, but querying the envvar twice is ugly. Use
r = parse_boolean(e);
instead.
src/nspawn/nspawn.c
Outdated
| arg_mount_settings &= ~MOUNT_APPLY_APIVFS_NETNS; | ||
| } else { | ||
| arg_mount_settings |= MOUNT_APPLY_APIVFS_RO; | ||
| arg_mount_settings &= ~MOUNT_APPLY_APIVFS_NETNS; |
There was a problem hiding this comment.
Is this intentational that lines 403 and 406 are the same? If yes, it should be moved below the else block.
|
@keszybz thanks for the review! I'll address the comments in the next round. |
d824fdd to
c963254
Compare
|
@s-urbaniak , I think some combinations doesn't make sense, right? Shouldn't we call the |
|
@evverx hmm, I need to have a more detailed look into this (, but won't make until the beginning of the next week.) |
|
@evverx hmm, I can't reproduce this, I am recompiling my kernel now with user-namespace support, but at least without
works for me. I am apparently missing something, but I don't see a connection between write-only Can you provide me with a little more details on a setup that is reproducible for you? Thanks! PS: I am |
|
ha, userns-enabled kernel now let's me reproduce me this ... |
Indeed, this works without
|
60d9b45 to
81d12e2
Compare
|
(not that it matters, the commit msg talks about "/sysfs" currently, which should be "/sys" of course...) |
|
I'm reading rkt/rkt#3368
So, do we really need this feature at all? Who will use it? |
This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.
81d12e2 to
8e391ad
Compare
Just speaking for rkt (as I don't know if other people have the same needs): nspawn-based stage1 will use it. kvm-based stage1 doesn't need it. An hypothetical runc-based stage1 will not need it. But:
TLDR: rkt may consider a runc-based stage1 at some point, which would come with its own set of pros and cons, quite a bit work, and still need initial investigation. nspawn-based stage1 is what we consider stable and we keep working to make it a bit more flexible. This PR goes into that direction, solves a current issue we have, and may (or may not) be useful for anybody else using an externally prepared net-ns. (sorry for stealing this PR for rkt roadmapping :) |
|
@lucab , thanks for the explanation!
Indeed, makes sense. And my last question :-) |
This follows what was recommended for the other clone-related flags here: |
|
Well, indeed,
But |
|
@evverx well, it's not crazy, but it's also not the kind of stuff I want to push people towards, and make fully supported with guarantees to stay around forever in exactly the current form... But you are right, we should have some kind of documentation for all the env vars we have collected now. But I am not sure it should be a proper user-facing man page, but more some text file (or markdown file) in the git tree, that lists all the special env vars we added over the time, and explains them briefly, but also makes clear that these aren't really primary use controls, but mostly exist with reduced stability gurantees and primarily for debug purposes, and for specific purposes of external software, but not so much for humans... Both nspawn and systemctl collected quite a few env vars over the times, so I figure a doc would be appropriate indeed. |
@s-urbaniak , many thanks! |
systemd v232 introduced an hardcoded protection for /proc/sysrq-trigger, so rkt insecure mode does not work anymore for that path. Disabling the insecure part of TestPathsWrite for now, can be re-enabled once systemd/systemd#4395 is available in stage1.
We don't print all journal files. This is misleading a bit: systemd/systemd#4331 (comment) systemd/systemd#4395 (comment)
We don't print all journal files. This is misleading a bit: systemd/systemd#4331 (comment) systemd/systemd#4395 (comment)
Imported using git-ubuntu import. Changelog parent: 018d813 New changelog entries: [ Martin Pitt ] * debian/tests/unit-config: Query pkg-config for system unit dir. This fixes confusion on merged-/usr systems where both /usr/lib/systemd and /lib/systemd exist. It's actually useful to verify that systemd.pc says the truth. * debian/tests/upstream: Fix clobbering of merged-/usr symlinks * debian/tests/systemd-fsckd: Create /etc/default/grub.d if necessary * debian/rules: Drop check for linking to libs in /usr. This was just an approximation, as booting without an initrd could still be broken by library updates (e. g. #828991). With merged /usr now being the default this is now completely moot. * Move kernel-install initrd script to a later prefix. 60- does not leave much room for scripts that want to run before initrd building (which is usually one of the latest things to do), so bump to 85. Thanks to Sjoerd Simons for the suggestion. * Disable 99-default.link instead of the udev rule for disabling persistent interface names. Disabling 80-net-setup-link.rules will also cause ID_NET_DRIVER to not be set any more, which breaks 80-container-ve.network and matching on driver name in general. So disable the actual default link policy instead. Still keep testing for 80-net-setup-link.rules in the upgrade fix and 73-usb-net-by-mac.rules to keep the desired behaviour on systems which already disabled ifnames via that udev rule. See https://lists.freedesktop.org/archives/systemd-devel/2016-November/037805.html * debian/tests/boot-and-services: Always run seccomp test seccomp is now available on all architectures on which Debian and Ubuntu run tests, so stop making this test silently skip if seccomp is disabled. * Bump libseccomp build dependency as per configure.ac. * Replace "Drop RestrictAddressFamilies=" patch with sed call. With that it will also apply to upstream builds/CI, and it is structurally simpler. * Rebuild against libseccomp with fixed shlibs. (Closes: #844497) [ Michael Biebl ] * fstab-generator: add x-systemd.mount-timeout option. (Closes: #843989) * build-sys: do not install ctrl-alt-del.target symlink twice. (Closes: #844039) * Enable lz4 support. While the compression rate is not as good as XZ, it is much faster, so a better default for the journal and especially systemd-coredump. (Closes: #832010) [ Felipe Sateler ] * Enable machines.target by default. (Closes: #806787) [ Evgeny Vereshchagin ] * debian/tests/upstream: Print all journal files. We don't print all journal files. This is misleading a bit: systemd#4331 (comment) systemd#4395 (comment) [ Luca Boccassi ] * Use mount --move in initramfs-tools udev script. Due to recent changes in busybox and initramfs-tools the mount utility is no longer the one from busybox but from util-linux. The latter does not support mount -o move. The former supports both -o move and --move, so use it instead to be compatible with both. See this discussion for more details: https://bugs.debian.org/823856 (Closes: #844775)
This adds override env vars to leave /sys, and /proc read-write using
SYSTEMD_NSPAWN_MOUNT_RW=true. If unset or set to false, the current behavior
is preserved.
It also introduces SYSTEMD_NSPAWN_USE_NETNS which makes systemd-nspawn behave as if
--private-networkwas specified.This is useful in certain use-cases, where systemd-nspawn is used in
an external network namespace leaving /proc/sys/net read-write.
This is also a follow-up on the discussion from
#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
Related (amongst many others): rkt/rkt#3245