nspawn: R/W support for /sysfs, /proc, and /proc/sys/net by s-urbaniak · Pull Request #4395 · systemd/systemd

s-urbaniak · 2016-10-17T12:06:29Z

This adds override env vars to leave /sys, and /proc read-write using
SYSTEMD_NSPAWN_MOUNT_RW=true. If unset or set to false, the current behavior
is preserved.

It also introduces SYSTEMD_NSPAWN_USE_NETNS which makes systemd-nspawn behave as if
--private-network was specified.

This is useful in certain use-cases, where systemd-nspawn is used in
an external network namespace leaving /proc/sys/net read-write.

This is also a follow-up on the discussion from
#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.

Related (amongst many others): rkt/rkt#3245

s-urbaniak · 2016-10-17T12:07:35Z

src/nspawn/nspawn-mount.c

+                { NULL,                  "/sys",                NULL,    NULL,        MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_FATAL|MOUNT_RO },
+                { "tmpfs",               "/dev",                "tmpfs", "mode=755",  MS_NOSUID|MS_STRICTATIME,                                  MOUNT_FATAL },
+                { "tmpfs",               "/dev/shm",            "tmpfs", "mode=1777", MS_NOSUID|MS_NODEV|MS_STRICTATIME,                         MOUNT_FATAL },
+                { "tmpfs",               "/run",                "tmpfs", "mode=755",  MS_NOSUID|MS_NODEV|MS_STRICTATIME,                         MOUNT_FATAL },


This mount table looked, and looks pretty complicated, but I didn't have a better idea on how to refactor this to be more readable.

s-urbaniak · 2016-10-17T12:09:21Z

/cc @lucab

poettering

looks ok conceptually, just a few more chnages please!

poettering · 2016-10-17T12:20:00Z

src/nspawn/nspawn-mount.h

+typedef enum MountSettingsMask {
+        MOUNT_FATAL             = 1 << 0,
+        MOUNT_IN_USERNS         = 1 << 1,
+        MOUNT_USE_USERNS        = 1 << 3,


if we have these nice flags, I think we should also cease the chance and document them in a nice way, i.e. add some /* */ explanation here to the end of relevant flags. I figure most flags are pretty much self-explanatory, but I think the namespacing-related options could really use a brief explanation, i.e. what the difference between "IN_USERNS" and "USE_USERNS" actually is...

poettering · 2016-10-17T12:21:43Z

src/nspawn/nspawn-mount.h

+        MOUNT_IN_USERNS         = 1 << 1,
+        MOUNT_USE_USERNS        = 1 << 3,
+        MOUNT_USE_NETNS         = 1 << 2,
+        MOUNT_RO                = 1 << 4,


Hmm, the RO flag is a bit misleadingly named, no? after all it doesn't control read-only vs. writable. Instead it is simply a mask for the actual RO bit used. As such, it should probably named MASK_RO instead of just RO, no?

(or did I misunderstand this?)

Semantically it behaves like MOUNT_USE_NETNS omitting those mounts if unset. So how about if we call both MOUNT_MASK_NETNS, and MOUNT_MASK_RO?

poettering · 2016-10-17T12:23:16Z

src/nspawn/nspawn.c

        parse_share_ns_env("SYSTEMD_NSPAWN_SHARE_SYSTEM", CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS);

+        if (getenv_bool("SYSTEMD_NSPAWN_MOUNT_RW") <= 0)
+                mount_settings |= MOUNT_RO;


I'd probably turn this around, and set mount_settings to MOUNT_RO when declaring the variable (instead of setting it to 0 then), and then remove the bit, if the env var here is set. I think it's a good idea to have the static arg_xyz variables all initialized to their actual defaults by default, in order to make things more obvious to readers.

poettering · 2016-10-17T12:24:23Z

src/nspawn/nspawn.c

 static bool arg_notify_ready = false;
 static bool arg_use_cgns = true;
 static unsigned long arg_clone_ns_flags = CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS;
+static MountSettingsMask mount_settings = 0;


These global, static variables that directly relate to cmdline arguments (or in some case, including this one, to env vars), should all be prefixed with the "arg_" prefix, to indicate that their are process arguments instead of internal state. We followed that pretty strictly so far in our tree, and should do so here, I think

poettering · 2016-10-17T12:26:37Z

src/nspawn/nspawn-mount.c

+                bool in_userns = (mount_table[k].mount_settings & MOUNT_IN_USERNS) > 0;
+                bool use_userns = (mount_table[k].mount_settings & MOUNT_USE_USERNS) > 0;
+
+                if (((mount_settings ^ mount_table[k].mount_settings) & MOUNT_IN_USERNS) > 0)


why the > 0 check here? you can just drop that and rely on C's downgrade-to-bool feature here. After all, this is actually a bool-like expression that you are using here, hence using it as a bool if you can makes a ton of sense.

poettering · 2016-10-17T12:27:21Z

src/nspawn/nspawn-mount.c

                o = mount_table[k].options;
                if (streq_ptr(mount_table[k].type, "tmpfs")) {
-                        if (in_userns)
+                        if (in_userns > 0)


in_userns is a bool, it's strange doing a greater-than comparison with bools...

poettering · 2016-10-17T12:29:49Z

src/nspawn/nspawn-mount.c

+                remount_flags |= MS_RDONLY;
+        }
+
+        r = mount_verbose(LOG_ERR, "sysfs", full, "sysfs", sysfs_flags, NULL);


I'd really do this the other way round: leave the literal flags parameter being passed to mount_verbose() in place, but add an |extra to the ORed flags as last step. Then, simply define extra as long, and initialize it based on the MOUNT_RO flag, and reuse it for both cases. Such a change would be much shorter, and only needs one extra var, instead of two...

poettering · 2016-10-17T12:32:36Z

src/nspawn/nspawn.c

+                mount_settings |= MOUNT_USE_NETNS;
+
+        if (getenv_bool("SYSTEMD_NSPAWN_USE_NETNS") <= 0)
+                mount_settings |= MOUNT_USE_NETNS;


not sure I get this one. What's the precise usecase here? Why wouldn't one use --private-network here?

--private-network creates its own network namespace, which we would like to omit here. The use case is rkt+an external network configurator (e.g. CNI), where we already have an external network namespace, join it, and have /proc/sys/net read-write.

so, what precisely is left of --private-network's functionality if you use $SYSTEMD_NSPAWN_USE_NETNS? Only the fact that /proc/sys/net remains writable?

So, summarizing your patch set with $SYSTEMD_NSPAWN_MOUNT_RW you get all of /proc/, /sys writable, but with $SYSTEMD_NSPAWN_USE_NETNS you get only /proc/sys/net writable but everything else still read-only? If so, the latter option is a true subset of the former? If so, maybe this should not be two bool env vars, but instead one that takes an enum? Maybe $SYSTEMD_API_VFS_WRITABLE=auto|yes|no|network or so?

SYSTEMD_API_VFS_WRITABLE sounds good to me, I'll head this route.

s-urbaniak · 2016-10-17T12:49:45Z

@poettering thanks for the swift review! I'll address the comments for the next round (and apparently something broke my patch in the last 4 days)

evverx · 2016-10-18T05:07:47Z

Xenial

+ env UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.uJpT5e/nspawn-root /lib/systemd/systemd root=/dev/sda1 raid=noautodetect loglevel=2 init=/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.unified_cgroup_hierarchy=no
Spawning container nspawn-root on /var/tmp/systemd-test.uJpT5e/nspawn-root.
Press ^] three times within 1s to kill container.
Timezone Etc/UTC does not exist in container, not updating container timezone.
Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): No such file or directory
Child died too early.

CentOS

+ env UNIFIED_CGROUP_HIERARCHY=no ../../systemd-nspawn --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.1XpIOE/nspawn-root /usr/lib/systemd/systemd
Spawning container nspawn-root on /var/tmp/systemd-test.1XpIOE/nspawn-root.
Press ^] three times within 1s to kill container.
Failed to mount n/a on /var/tmp/systemd-test.1XpIOE/nspawn-root/etc/resolv.conf (MS_BIND ""): No such file or directory
Failed to determine if /sys/fs/cgroup is already mounted: No such file or directory

BTW: there is a smoke test for systemd-nspawn: #4378
I think we should merge #4378 first.

SYSTEMD_API_VFS_WRITABLE=auto|yes|no|network

I think we should add SYSTEMD_NSPAWN_API_VFS_WRITABLE to the smoke test too

s-urbaniak · 2016-10-18T10:05:09Z

@evverx Thanks for the test results, I made indeed a mistake enabling USE_NETNS by default (https://github.com/systemd/systemd/pull/4395/files#diff-db6b16c9ade7e951e3bb793bec110098R1083) which I already fixed locally.

I am fine with #4378 merging first, and will try to add a smoke test for SYSTEMD_NSPAWN_API_VFS_WRITABLE too.

evverx · 2016-10-18T10:30:38Z

@s-urbaniak

will try to add a smoke test for SYSTEMD_NSPAWN_API_VFS_WRITABLE too.

Feel free to ping me

(but, actually, I don't know how to pass that smoke-test: #4372 (comment) :-))

poettering

some more comments.

poettering · 2016-10-18T15:15:20Z

src/nspawn/nspawn.c

+        else if (streq(e, "no"))
+                arg_mount_settings |= MOUNT_MASK_RO;
+        else if (streq(e, "network"))
+                arg_mount_settings |= MOUNT_MASK_NETNS;


So, I think we should use the "extended bool" logic to parse this, which we frequently use in other cases like this. Specifically, this means checking first if the parameter has the "network" value. And if not, then fall back to parsing it as a boolean via parse_boolean(). That way, people can treat it as a usual boolean (and also use "0", "off" and "false" as negative value, and "1", "on", "true" as positive value), except that it understands one more, special value, which is "network".

I also think this should generate a warning message, when we don't understand the passed parameter. i.e. use log_warning() and say that the value set in the env var is not understood and will be ignored.

poettering · 2016-10-18T15:16:09Z

src/nspawn/nspawn.c

+                return;
+
+        if (streq(e, "yes"))
+                arg_mount_settings &= ~MOUNT_MASK_RO;


this should not only unset MOUNT_MASK_RO, but also MOUNT_MASK_NETNS

yes, indeed, that's a good idea.

poettering · 2016-10-18T15:16:28Z

src/nspawn/nspawn.c

+        if (streq(e, "yes"))
+                arg_mount_settings &= ~MOUNT_MASK_RO;
+        else if (streq(e, "no"))
+                arg_mount_settings |= MOUNT_MASK_RO;


this should not only set MOUNT_MASK_RO, but also unset MOUNT_MASK_NETNS, no?

poettering · 2016-10-18T15:27:34Z

src/nspawn/nspawn-mount.h

+        MOUNT_USE_USERNS        = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */
+        MOUNT_IN_USERNS         = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */
+        MOUNT_MASK_NETNS        = 1 << 3, /* masks mounts to be executed in a network namespace in addition to all other mounts */
+        MOUNT_MASK_RO           = 1 << 4, /* masks mounts to be executed read-only in addition to all other mounts */


i am a bit confused by the latter two. do these flags now turn the relevant mount operations off? or do they turn them on? the comments need to be clearer about this.

If i got this right, the MASK_RO set means "/proc/sys is read-only", correct? And if MASK_NETNS is set that means "/proc/sys/net" is read-only, right?

Thinking about this I have the suspicion that "mask" is actually a weird word here and we really shouldn't use it. After all, we use these flags both for the "mask" that is set in the env var, but also for the flags in the mount table. And one is a mask, the other a condition i would say...

Maybe we should call them "MOUNT_APPLY_APIVFS_RO" and "MOUNT_APPLY_APIVFS_NETNS" or so... dunno.

sorry for the confusoin around this...

i am a bit confused by the latter two. do these flags now turn the relevant mount operations off? or do they turn them on? the comments need to be clearer about this.

Yes, they are turning features "on", I'll enhance the comments.

If i got this right, the MASK_RO set means "/proc/sys is read-only", correct?

yes

And if MASK_NETNS is set that means "/proc/sys/net" is read-only, right?

no, that means "/proc/sys/net" will remain read-write.

Maybe we should call them "MOUNT_APPLY_APIVFS_RO" and "MOUNT_APPLY_APIVFS_NETNS" or so... dunno.

agreed

sorry for the confusoin around this...

No worries :-)

evverx · 2016-10-19T09:32:52Z

systemd-nspawn -D ./cont -U -b
...
Selected user namespace base 1605697536 and range 65536.
Timezone Etc/UTC does not exist in container, not updating container timezone.
Failed to write boot id: Permission denied

Seems like the outer child doesn't patch tmpfs options:

2599  mount("tmpfs", "/home/vagrant/systemd/./cont/run", "tmpfs", MS_NOSUID|MS_NODEV|MS_STRICTATIME, "mode=755") = 0

Should be something like

2890  mount("tmpfs", "/home/vagrant/systemd/./cont/run", "tmpfs", MS_NOSUID|MS_NODEV|MS_STRICTATIME, "mode=755,uid=1605697536,gid=1605697536") = 0

s-urbaniak · 2016-10-19T11:23:12Z

@evverx gah, thanks, already spotted my mistake.

keszybz · 2016-10-20T15:21:42Z

src/nspawn/nspawn-mount.c

+int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
        const char *full, *top, *x;
        int r;
+        long extra_flags = 0;


unsigned long

keszybz · 2016-10-20T15:23:30Z

src/nspawn/nspawn-mount.c

+                { NULL,                  "/proc/sys",           NULL,    NULL,        MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_FATAL|MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO },                          /* ... then, make it r/o */
+                { "/proc/sysrq-trigger", "/proc/sysrq-trigger", NULL,    NULL,        MS_BIND,                                                               MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO },                          /* Bind mount first ...*/
+                { NULL,                  "/proc/sysrq-trigger", NULL,    NULL,        MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT,             MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO },                          /* ... then, make it r/o */
+                { "tmpfs",               "/tmp",                "tmpfs", "mode=1777", MS_STRICTATIME,                                            MOUNT_FATAL|MOUNT_IN_USERNS },


Thanks, this table is finally legible.

keszybz · 2016-10-20T15:29:35Z

src/nspawn/nspawn-mount.c

                const char *o;
+                bool fatal = (mount_table[k].mount_settings & MOUNT_FATAL);
+
+                if ((mount_settings ^ mount_table[k].mount_settings) & MOUNT_IN_USERNS)


Dunno, I don't find those bitmask manipulations very clear. I think it would be easier to read if this was written as
if ((mount_settings & MOUNT_IN_USERNS) != (mount_table[k].mount_settings & MOUNT_IN_USERNS))

keszybz · 2016-10-20T15:34:22Z

src/nspawn/nspawn-mount.c


-                if (!use_netns && mount_table[k].use_netns)
+                if (~mount_settings & mount_table[k].mount_settings & MOUNT_APPLY_APIVFS_RO)
                        continue;


A helper variable would make the intent clear:

bool leftover_flags = ~mount_settings & mount_table[k].mount_settings; if (leftover_flags & MOUNT_APPLY_APIVFS_NETNS || leftover_flags & MOUNT_APPLY_APIVFS_RO) continue;

keszybz · 2016-10-20T15:34:47Z

src/nspawn/nspawn-mount.h

 #include "cgroup-util.h"

+typedef enum MountSettingsMask {
+        MOUNT_FATAL              = 1 << 0, /* if set, a mount error are considered fatal */


keszybz · 2016-10-20T15:35:20Z

src/nspawn/nspawn-mount.h

+        MOUNT_FATAL              = 1 << 0, /* if set, a mount error are considered fatal */
+        MOUNT_USE_USERNS         = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */
+        MOUNT_IN_USERNS          = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */
+        MOUNT_APPLY_APIVFS_RO    = 1 << 3, /* if set, /proc/sys, and /sysfs will be mounted read-only, else read-write. */


keszybz · 2016-10-20T15:35:42Z

src/nspawn/nspawn-mount.h

+        MOUNT_USE_USERNS         = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */
+        MOUNT_IN_USERNS          = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */
+        MOUNT_APPLY_APIVFS_RO    = 1 << 3, /* if set, /proc/sys, and /sysfs will be mounted read-only, else read-write. */
+        MOUNT_APPLY_APIVFS_NETNS = 1 << 4, /* if set, /proc/sys/net will be mounted read-write. Works only if MOUNT_APPLY_APIVFS_NETNS is also set. */


Wrap the comment?

keszybz · 2016-10-20T15:36:13Z

src/nspawn/nspawn.c

        arg_clone_ns_flags = (arg_clone_ns_flags & ~ns_flag) | (r > 0 ? 0 : ns_flag);
 }

+


Spurious empty line.

keszybz · 2016-10-20T15:38:53Z

src/nspawn/nspawn.c

+                return;
+        }
+
+        r = getenv_bool("SYSTEMD_NSPAWN_API_VFS_WRITABLE");


It's a minor thing, but querying the envvar twice is ugly. Use

r = parse_boolean(e);

instead.

keszybz · 2016-10-20T15:40:37Z

src/nspawn/nspawn.c

+                arg_mount_settings &= ~MOUNT_APPLY_APIVFS_NETNS;
+        } else {
+                arg_mount_settings |= MOUNT_APPLY_APIVFS_RO;
+                arg_mount_settings &= ~MOUNT_APPLY_APIVFS_NETNS;


Is this intentational that lines 403 and 406 are the same? If yes, it should be moved below the else block.

s-urbaniak · 2016-10-20T15:58:36Z

@keszybz thanks for the review! I'll address the comments in the next round.

evverx · 2016-10-21T13:47:09Z

@s-urbaniak , I think some combinations doesn't make sense, right?

+ UNIFIED_CGROUP_HIERARCHY=no
+ SYSTEMD_NSPAWN_USE_CGNS=no
+ SYSTEMD_NSPAWN_API_VFS_WRITABLE=network
+ systemd-nspawn --register=no -D /var/lib/machines/unified-no-cgns-no-apivfs-network -U -b
Spawning container unified-no-cgns-no-apivfs-network on /var/lib/machines/unified-no-cgns-no-apivfs-network.
...
Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")...
Mounting sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC "")...
Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): Operation not permitted
Failed to chown() cgroup file cgroup.events, ignoring: No such file or directory
Failed to chown() cgroup file cgroup.controllers, ignoring: No such file or directory
Failed to chown() cgroup file cgroup.subtree_control, ignoring: No such file or directory
Umounting /tmp/unifiedQhn3DL...
Failed to chown() cgroup file tasks, ignoring: No such file or directory
Failed to chown() cgroup file notify_on_release, ignoring: No such file or directory
Failed to chown() cgroup file cgroup.clone_children, ignoring: No such file or directory
Child died too early.

Shouldn't we call the verify_arguments or so?

s-urbaniak · 2016-10-25T07:22:46Z

@evverx hmm, I need to have a more detailed look into this (, but won't make until the beginning of the next week.)

s-urbaniak · 2016-11-02T15:33:21Z

@evverx hmm, I can't reproduce this, I am recompiling my kernel now with user-namespace support, but at least without -U:

sudo UNIFIED_CGROUP_HIERARCHY=no SYSTEMD_NSPAWN_USE_CGNS=no SYSTEMD_NSPAWN_API_VFS_WRITABLE=network ~/src/systemd/systemd-nspawn -D ~/tmp/debian-sid

works for me. I am apparently missing something, but I don't see a connection between write-only /proc/sys/net, and the above settings (non-unified cgroup hierarchy, no cgroup namespace)?!

Can you provide me with a little more details on a setup that is reproducible for you?

Thanks!

PS: I am sur on freenode, also in #rkt, and #rkt-dev.

s-urbaniak · 2016-11-02T16:29:48Z

ha, userns-enabled kernel now let's me reproduce me this ...

evverx · 2016-11-02T16:30:10Z

@s-urbaniak

but at least without -U ...
works for me

Indeed, this works without -U. I think -U breaks something.

Can you provide me with a little more details on a setup that is reproducible for you?

-bash-4.3# cat /proc/version
Linux version 4.7.9-200.fc24.x86_64 ([email protected]) (gcc version 6.2.1 20160916 (Red Hat 6.2.1-2) (GCC) ) #1 SMP Thu Oct 20 14:26:16 UTC 2016

-bash-4.3# UNIFIED_CGROUP_HIERARCHY=no SYSTEMD_NSPAWN_USE_CGNS=no SYSTEMD_NSPAWN_API_VFS_WRITABLE=network systemd-nspawn --register=no -D ./cont/ -U -b
Spawning container cont on /root/./cont.
Press ^] three times within 1s to kill container.
Child died too early.
Selected user namespace base 1374814208 and range 65536.
Failed to mount n/a on /root/./cont/sys/fs/selinux (MS_BIND ""): No such file or directory
Failed to mount n/a on /root/./cont/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND ""): Invalid argument
Timezone Etc/UTC does not exist in container, not updating container timezone.
Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): Operation not permitted

# inner child
407   clone(child_stack=NULL, flags=CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|SIGCHLD) = 439
...
# outer child
439   mkdir("/proc/sysrq-trigger", 0755) = -1 EEXIST (File exists)
439   stat("/proc/sysrq-trigger", {st_dev=makedev(0, 46), st_ino=4026532058, st_mode=S_IFREG|0200, st_nlink=1, st_uid=65534, st_gid=65534, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2016/11/02-07:0
439   mount(NULL, "/proc/sysrq-trigger", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0
439   lstat("/tmp", 0x7ffca702fb20)     = -1 ENOENT (No such file or directory)
439   mkdir("/tmp", 0755)               = 0
439   mount("tmpfs", "/tmp", "tmpfs", MS_STRICTATIME, "mode=1777,uid=0,gid=0") = 0
439   open("/sys", O_RDONLY)            = 9</sys>
439   fstatfs(9</sys>, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=62592, f_bfree=62592, f_bavail=62592, f_files=62592, f_ffree=62589, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSU
439   close(9</sys>)                    = 0
439   mkdir("/sys/full", 0755)          = 0
439   mount("sysfs", "/sys/full", "sysfs", MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)
439   writev(2</0>, [{iov_base="Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC \"\"): Operation not permitted", iov_len=103}, {iov_base="\n", iov_len=1}], 2) = 104
439   exit_group(1)                     = ?

systemd#4395

poettering · 2016-11-17T17:23:18Z

(not that it matters, the commit msg talks about "/sysfs" currently, which should be "/sys" of course...)

evverx · 2016-11-17T17:37:25Z

I'm reading rkt/rkt#3368

Presumably ditch nspawn eventually
...
Transitioning from nspawn to runc for stage1-spawning will help supporting some more cases

So, do we really need this feature at all? Who will use it?
@poettering , @s-urbaniak , @lucab ?

This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from systemd#4018 (comment) where an introduction of a simple env var to enable R/W support for those directories was already discussed.

systemd#4395

lucab · 2016-11-18T10:02:43Z

So, do we really need this feature at all? Who will use it?

Just speaking for rkt (as I don't know if other people have the same needs): nspawn-based stage1 will use it. kvm-based stage1 doesn't need it. An hypothetical runc-based stage1 will not need it. But:

we only have nspawn&kvm stage1 at the moment, and no direct plan for a runc-based stage1 right now. See Investigate integrating Runc rkt/rkt#3327 (comment) for a better explanation.
the comment you quoted is related to a broader discussion about incompatible pod-level and app-level features in rkt. @euank suggested that using runc instead of nspawn would solve it, and I replied that it would solve something, but the general problem still exists.

TLDR: rkt may consider a runc-based stage1 at some point, which would come with its own set of pros and cons, quite a bit work, and still need initial investigation. nspawn-based stage1 is what we consider stable and we keep working to make it a bit more flexible. This PR goes into that direction, solves a current issue we have, and may (or may not) be useful for anybody else using an externally prepared net-ns.

(sorry for stealing this PR for rkt roadmapping :)

evverx · 2016-11-18T10:40:28Z

@lucab , thanks for the explanation!

may be useful for anybody else using an externally prepared net-ns

Indeed, makes sense.

And my last question :-)
Why not to document SYSTEMD_NSPAWN_API_VFS_WRITABLE?

lucab · 2016-11-18T10:44:10Z

Why not to document SYSTEMD_NSPAWN_API_VFS_WRITABLE?

This follows what was recommended for the other clone-related flags here:
#2982 (comment).

evverx · 2016-11-18T10:47:39Z

Well, indeed, --share-system and friends

are pretty crazy, and we should not have them

But SYSTEMD_NSPAWN_API_VFS_WRITABLE doesn't look crazy (to me at least)

poettering · 2016-11-18T11:39:16Z

@evverx well, it's not crazy, but it's also not the kind of stuff I want to push people towards, and make fully supported with guarantees to stay around forever in exactly the current form...

But you are right, we should have some kind of documentation for all the env vars we have collected now. But I am not sure it should be a proper user-facing man page, but more some text file (or markdown file) in the git tree, that lists all the special env vars we added over the time, and explains them briefly, but also makes clear that these aren't really primary use controls, but mostly exist with reduced stability gurantees and primarily for debug purposes, and for specific purposes of external software, but not so much for humans... Both nspawn and systemctl collected quite a few env vars over the times, so I figure a doc would be appropriate indeed.

evverx · 2016-11-18T14:32:57Z

But I am not sure it should be a proper user-facing man page, but more some text file (or markdown file) in the git tree, that lists all the special env vars we added over the time

git log -p :-)

@s-urbaniak , many thanks!

systemd v232 introduced an hardcoded protection for /proc/sysrq-trigger, so rkt insecure mode does not work anymore for that path. Disabling the insecure part of TestPathsWrite for now, can be re-enabled once systemd/systemd#4395 is available in stage1.

We don't print all journal files. This is misleading a bit: systemd/systemd#4331 (comment) systemd/systemd#4395 (comment)

Imported using git-ubuntu import. Changelog parent: 018d813 New changelog entries: [ Martin Pitt ] * debian/tests/unit-config: Query pkg-config for system unit dir. This fixes confusion on merged-/usr systems where both /usr/lib/systemd and /lib/systemd exist. It's actually useful to verify that systemd.pc says the truth. * debian/tests/upstream: Fix clobbering of merged-/usr symlinks * debian/tests/systemd-fsckd: Create /etc/default/grub.d if necessary * debian/rules: Drop check for linking to libs in /usr. This was just an approximation, as booting without an initrd could still be broken by library updates (e. g. #828991). With merged /usr now being the default this is now completely moot. * Move kernel-install initrd script to a later prefix. 60- does not leave much room for scripts that want to run before initrd building (which is usually one of the latest things to do), so bump to 85. Thanks to Sjoerd Simons for the suggestion. * Disable 99-default.link instead of the udev rule for disabling persistent interface names. Disabling 80-net-setup-link.rules will also cause ID_NET_DRIVER to not be set any more, which breaks 80-container-ve.network and matching on driver name in general. So disable the actual default link policy instead. Still keep testing for 80-net-setup-link.rules in the upgrade fix and 73-usb-net-by-mac.rules to keep the desired behaviour on systems which already disabled ifnames via that udev rule. See https://lists.freedesktop.org/archives/systemd-devel/2016-November/037805.html * debian/tests/boot-and-services: Always run seccomp test seccomp is now available on all architectures on which Debian and Ubuntu run tests, so stop making this test silently skip if seccomp is disabled. * Bump libseccomp build dependency as per configure.ac. * Replace "Drop RestrictAddressFamilies=" patch with sed call. With that it will also apply to upstream builds/CI, and it is structurally simpler. * Rebuild against libseccomp with fixed shlibs. (Closes: #844497) [ Michael Biebl ] * fstab-generator: add x-systemd.mount-timeout option. (Closes: #843989) * build-sys: do not install ctrl-alt-del.target symlink twice. (Closes: #844039) * Enable lz4 support. While the compression rate is not as good as XZ, it is much faster, so a better default for the journal and especially systemd-coredump. (Closes: #832010) [ Felipe Sateler ] * Enable machines.target by default. (Closes: #806787) [ Evgeny Vereshchagin ] * debian/tests/upstream: Print all journal files. We don't print all journal files. This is misleading a bit: systemd#4331 (comment) systemd#4395 (comment) [ Luca Boccassi ] * Use mount --move in initramfs-tools udev script. Due to recent changes in busybox and initramfs-tools the mount utility is no longer the one from busybox but from util-linux. The latter does not support mount -o move. The former supports both -o move and --move, so use it instead to be compatible with both. See this discussion for more details: https://bugs.debian.org/823856 (Closes: #844775)

s-urbaniak commented Oct 17, 2016

View reviewed changes

s-urbaniak mentioned this pull request Oct 17, 2016

--insecure-options=all-run should mount /proc/sys read-write rkt/rkt#3245

Open

poettering requested changes Oct 17, 2016

View reviewed changes

poettering added the nspawn label Oct 17, 2016

poettering reviewed Oct 17, 2016

View reviewed changes

evverx added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Oct 18, 2016

evverx mentioned this pull request Oct 18, 2016

tests: add smoke test for systemd-nspawn #4378

Merged

s-urbaniak force-pushed the rw-support branch 2 times, most recently from 398f610 to 3dd3bc9 Compare October 18, 2016 13:30

poettering reviewed Oct 18, 2016

View reviewed changes

s-urbaniak mentioned this pull request Oct 20, 2016

nspawn: Regression in user namespacing #4421

Closed

2 tasks

s-urbaniak force-pushed the rw-support branch from 3dd3bc9 to d824fdd Compare October 20, 2016 13:47

keszybz requested changes Oct 20, 2016

View reviewed changes

keszybz added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Oct 20, 2016

s-urbaniak force-pushed the rw-support branch from d824fdd to c963254 Compare October 21, 2016 10:13

lucab mentioned this pull request Oct 24, 2016

stage1-coreos: new systemd (v232-pre) fails on TestSupplementaryGIDs rkt/rkt#3262

Closed

s-urbaniak force-pushed the rw-support branch from 60d9b45 to 81d12e2 Compare November 15, 2016 10:49

s-urbaniak pushed a commit to s-urbaniak/systemd that referenced this pull request Nov 15, 2016

tests/TEST-13-NSPAWN-SMOKE: add SYSTEMD_NSPAWN_API_VFS_WRITABLE

81d12e2

systemd#4395

lucab mentioned this pull request Nov 17, 2016

OCI: preparing for runtime-spec support at app-level rkt/rkt#3368

Open

3 tasks

s-urbaniak and others added 2 commits November 18, 2016 09:50

tests/TEST-13-NSPAWN-SMOKE: add SYSTEMD_NSPAWN_API_VFS_WRITABLE

8e391ad

systemd#4395

s-urbaniak force-pushed the rw-support branch from 81d12e2 to 8e391ad Compare November 18, 2016 08:51

evverx mentioned this pull request Nov 18, 2016

run tests against Ubuntu development version? #4611

Closed

poettering removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 18, 2016

poettering approved these changes Nov 18, 2016

View reviewed changes

poettering added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Nov 18, 2016

poettering merged commit f4ff4aa into systemd:master Nov 18, 2016

This was referenced Nov 20, 2016

Document (and/or debug) launching rkt in rkt rkt/rkt#2158

Closed

cherry-pick: nspawn: R/W support for /sys, and /proc/sys (v232) coreos/systemd#70

Closed

s-urbaniak mentioned this pull request Nov 25, 2016

cherry-pick: nspawn: R/W support for /sys, and /proc/sys (v231) coreos/systemd#71

Merged

evverx mentioned this pull request Dec 2, 2016

More namespace improvements #4727

Merged

manover pushed a commit to manover/systemd that referenced this pull request Jan 21, 2017

debian/tests/upstream: Print all journal files

9a11f52

We don't print all journal files. This is misleading a bit: systemd/systemd#4331 (comment) systemd/systemd#4395 (comment)

xaiki pushed a commit to endlessm/systemd that referenced this pull request Feb 1, 2017

debian/tests/upstream: Print all journal files

8ddb913

We don't print all journal files. This is misleading a bit: systemd/systemd#4331 (comment) systemd/systemd#4395 (comment)

tixxdz mentioned this pull request Mar 15, 2017

/proc/sys/net/core/somaxconn 128 in the container #5592

Closed

2 tasks

dato mentioned this pull request Nov 23, 2017

Cannot disable private network from nspawn settings file if private users are used #7429

Closed

Spaceless007 mentioned this pull request Jul 21, 2020

nspawn: have a setting to be able to use --bind with /sys/ directories, can't right now since RO #16528

Open

		arg_clone_ns_flags = (arg_clone_ns_flags & ~ns_flag) \| (r > 0 ? 0 : ns_flag);
		}

Uh oh!

Conversation

s-urbaniak commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s-urbaniak commented Oct 17, 2016

Uh oh!

poettering left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s-urbaniak commented Oct 17, 2016

Uh oh!

evverx commented Oct 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s-urbaniak commented Oct 18, 2016

Uh oh!

evverx commented Oct 18, 2016

Uh oh!

poettering left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evverx commented Oct 19, 2016

Uh oh!

s-urbaniak commented Oct 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s-urbaniak commented Oct 17, 2016 •

edited

Loading

evverx commented Oct 18, 2016 •

edited

Loading

evverx commented Oct 21, 2016 •

edited

Loading

evverx commented Nov 18, 2016 •

edited

Loading