Skip to content

nspawn: R/W support for /sysfs, /proc, and /proc/sys/net#4395

Merged
poettering merged 2 commits intosystemd:masterfrom
s-urbaniak:rw-support
Nov 18, 2016
Merged

nspawn: R/W support for /sysfs, /proc, and /proc/sys/net#4395
poettering merged 2 commits intosystemd:masterfrom
s-urbaniak:rw-support

Conversation

@s-urbaniak
Copy link
Contributor

@s-urbaniak s-urbaniak commented Oct 17, 2016

This adds override env vars to leave /sys, and /proc read-write using
SYSTEMD_NSPAWN_MOUNT_RW=true. If unset or set to false, the current behavior
is preserved.

It also introduces SYSTEMD_NSPAWN_USE_NETNS which makes systemd-nspawn behave as if
--private-network was specified.

This is useful in certain use-cases, where systemd-nspawn is used in
an external network namespace leaving /proc/sys/net read-write.

This is also a follow-up on the discussion from
#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.

Related (amongst many others): rkt/rkt#3245

{ NULL, "/sys", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_FATAL|MOUNT_RO },
{ "tmpfs", "/dev", "tmpfs", "mode=755", MS_NOSUID|MS_STRICTATIME, MOUNT_FATAL },
{ "tmpfs", "/dev/shm", "tmpfs", "mode=1777", MS_NOSUID|MS_NODEV|MS_STRICTATIME, MOUNT_FATAL },
{ "tmpfs", "/run", "tmpfs", "mode=755", MS_NOSUID|MS_NODEV|MS_STRICTATIME, MOUNT_FATAL },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mount table looked, and looks pretty complicated, but I didn't have a better idea on how to refactor this to be more readable.

@s-urbaniak
Copy link
Contributor Author

/cc @lucab

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok conceptually, just a few more chnages please!

typedef enum MountSettingsMask {
MOUNT_FATAL = 1 << 0,
MOUNT_IN_USERNS = 1 << 1,
MOUNT_USE_USERNS = 1 << 3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have these nice flags, I think we should also cease the chance and document them in a nice way, i.e. add some /* */ explanation here to the end of relevant flags. I figure most flags are pretty much self-explanatory, but I think the namespacing-related options could really use a brief explanation, i.e. what the difference between "IN_USERNS" and "USE_USERNS" actually is...

MOUNT_IN_USERNS = 1 << 1,
MOUNT_USE_USERNS = 1 << 3,
MOUNT_USE_NETNS = 1 << 2,
MOUNT_RO = 1 << 4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the RO flag is a bit misleadingly named, no? after all it doesn't control read-only vs. writable. Instead it is simply a mask for the actual RO bit used. As such, it should probably named MASK_RO instead of just RO, no?

(or did I misunderstand this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically it behaves like MOUNT_USE_NETNS omitting those mounts if unset. So how about if we call both MOUNT_MASK_NETNS, and MOUNT_MASK_RO?

parse_share_ns_env("SYSTEMD_NSPAWN_SHARE_SYSTEM", CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS);

if (getenv_bool("SYSTEMD_NSPAWN_MOUNT_RW") <= 0)
mount_settings |= MOUNT_RO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably turn this around, and set mount_settings to MOUNT_RO when declaring the variable (instead of setting it to 0 then), and then remove the bit, if the env var here is set. I think it's a good idea to have the static arg_xyz variables all initialized to their actual defaults by default, in order to make things more obvious to readers.

static bool arg_notify_ready = false;
static bool arg_use_cgns = true;
static unsigned long arg_clone_ns_flags = CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWUTS;
static MountSettingsMask mount_settings = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These global, static variables that directly relate to cmdline arguments (or in some case, including this one, to env vars), should all be prefixed with the "arg_" prefix, to indicate that their are process arguments instead of internal state. We followed that pretty strictly so far in our tree, and should do so here, I think

bool in_userns = (mount_table[k].mount_settings & MOUNT_IN_USERNS) > 0;
bool use_userns = (mount_table[k].mount_settings & MOUNT_USE_USERNS) > 0;

if (((mount_settings ^ mount_table[k].mount_settings) & MOUNT_IN_USERNS) > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the > 0 check here? you can just drop that and rely on C's downgrade-to-bool feature here. After all, this is actually a bool-like expression that you are using here, hence using it as a bool if you can makes a ton of sense.

o = mount_table[k].options;
if (streq_ptr(mount_table[k].type, "tmpfs")) {
if (in_userns)
if (in_userns > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_userns is a bool, it's strange doing a greater-than comparison with bools...

remount_flags |= MS_RDONLY;
}

r = mount_verbose(LOG_ERR, "sysfs", full, "sysfs", sysfs_flags, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really do this the other way round: leave the literal flags parameter being passed to mount_verbose() in place, but add an |extra to the ORed flags as last step. Then, simply define extra as long, and initialize it based on the MOUNT_RO flag, and reuse it for both cases. Such a change would be much shorter, and only needs one extra var, instead of two...

mount_settings |= MOUNT_USE_NETNS;

if (getenv_bool("SYSTEMD_NSPAWN_USE_NETNS") <= 0)
mount_settings |= MOUNT_USE_NETNS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get this one. What's the precise usecase here? Why wouldn't one use --private-network here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--private-network creates its own network namespace, which we would like to omit here. The use case is rkt+an external network configurator (e.g. CNI), where we already have an external network namespace, join it, and have /proc/sys/net read-write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, what precisely is left of --private-network's functionality if you use $SYSTEMD_NSPAWN_USE_NETNS? Only the fact that /proc/sys/net remains writable?

So, summarizing your patch set with $SYSTEMD_NSPAWN_MOUNT_RW you get all of /proc/, /sys writable, but with $SYSTEMD_NSPAWN_USE_NETNS you get only /proc/sys/net writable but everything else still read-only? If so, the latter option is a true subset of the former? If so, maybe this should not be two bool env vars, but instead one that takes an enum? Maybe $SYSTEMD_API_VFS_WRITABLE=auto|yes|no|network or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYSTEMD_API_VFS_WRITABLE sounds good to me, I'll head this route.

@s-urbaniak
Copy link
Contributor Author

@poettering thanks for the swift review! I'll address the comments for the next round (and apparently something broke my patch in the last 4 days)

@evverx
Copy link
Contributor

evverx commented Oct 18, 2016

Xenial

+ env UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.uJpT5e/nspawn-root /lib/systemd/systemd root=/dev/sda1 raid=noautodetect loglevel=2 init=/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.unified_cgroup_hierarchy=no
Spawning container nspawn-root on /var/tmp/systemd-test.uJpT5e/nspawn-root.
Press ^] three times within 1s to kill container.
Timezone Etc/UTC does not exist in container, not updating container timezone.
Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): No such file or directory
Child died too early.

CentOS

+ env UNIFIED_CGROUP_HIERARCHY=no ../../systemd-nspawn --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.1XpIOE/nspawn-root /usr/lib/systemd/systemd
Spawning container nspawn-root on /var/tmp/systemd-test.1XpIOE/nspawn-root.
Press ^] three times within 1s to kill container.
Failed to mount n/a on /var/tmp/systemd-test.1XpIOE/nspawn-root/etc/resolv.conf (MS_BIND ""): No such file or directory
Failed to determine if /sys/fs/cgroup is already mounted: No such file or directory

BTW: there is a smoke test for systemd-nspawn: #4378
I think we should merge #4378 first.

SYSTEMD_API_VFS_WRITABLE=auto|yes|no|network

I think we should add SYSTEMD_NSPAWN_API_VFS_WRITABLE to the smoke test too

@evverx evverx added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Oct 18, 2016
@s-urbaniak
Copy link
Contributor Author

@evverx Thanks for the test results, I made indeed a mistake enabling USE_NETNS by default (https://github.com/systemd/systemd/pull/4395/files#diff-db6b16c9ade7e951e3bb793bec110098R1083) which I already fixed locally.

I am fine with #4378 merging first, and will try to add a smoke test for SYSTEMD_NSPAWN_API_VFS_WRITABLE too.

@evverx
Copy link
Contributor

evverx commented Oct 18, 2016

@s-urbaniak

will try to add a smoke test for SYSTEMD_NSPAWN_API_VFS_WRITABLE too.

Feel free to ping me

(but, actually, I don't know how to pass that smoke-test: #4372 (comment) :-))

@s-urbaniak s-urbaniak force-pushed the rw-support branch 2 times, most recently from 398f610 to 3dd3bc9 Compare October 18, 2016 13:30
Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments.

else if (streq(e, "no"))
arg_mount_settings |= MOUNT_MASK_RO;
else if (streq(e, "network"))
arg_mount_settings |= MOUNT_MASK_NETNS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think we should use the "extended bool" logic to parse this, which we frequently use in other cases like this. Specifically, this means checking first if the parameter has the "network" value. And if not, then fall back to parsing it as a boolean via parse_boolean(). That way, people can treat it as a usual boolean (and also use "0", "off" and "false" as negative value, and "1", "on", "true" as positive value), except that it understands one more, special value, which is "network".

I also think this should generate a warning message, when we don't understand the passed parameter. i.e. use log_warning() and say that the value set in the env var is not understood and will be ignored.

return;

if (streq(e, "yes"))
arg_mount_settings &= ~MOUNT_MASK_RO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not only unset MOUNT_MASK_RO, but also MOUNT_MASK_NETNS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed, that's a good idea.

if (streq(e, "yes"))
arg_mount_settings &= ~MOUNT_MASK_RO;
else if (streq(e, "no"))
arg_mount_settings |= MOUNT_MASK_RO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not only set MOUNT_MASK_RO, but also unset MOUNT_MASK_NETNS, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack :-)

MOUNT_USE_USERNS = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */
MOUNT_IN_USERNS = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */
MOUNT_MASK_NETNS = 1 << 3, /* masks mounts to be executed in a network namespace in addition to all other mounts */
MOUNT_MASK_RO = 1 << 4, /* masks mounts to be executed read-only in addition to all other mounts */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a bit confused by the latter two. do these flags now turn the relevant mount operations off? or do they turn them on? the comments need to be clearer about this.

If i got this right, the MASK_RO set means "/proc/sys is read-only", correct? And if MASK_NETNS is set that means "/proc/sys/net" is read-only, right?

Thinking about this I have the suspicion that "mask" is actually a weird word here and we really shouldn't use it. After all, we use these flags both for the "mask" that is set in the env var, but also for the flags in the mount table. And one is a mask, the other a condition i would say...

Maybe we should call them "MOUNT_APPLY_APIVFS_RO" and "MOUNT_APPLY_APIVFS_NETNS" or so... dunno.

sorry for the confusoin around this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a bit confused by the latter two. do these flags now turn the relevant mount operations off? or do they turn them on? the comments need to be clearer about this.

Yes, they are turning features "on", I'll enhance the comments.

If i got this right, the MASK_RO set means "/proc/sys is read-only", correct?

yes

And if MASK_NETNS is set that means "/proc/sys/net" is read-only, right?

no, that means "/proc/sys/net" will remain read-write.

Maybe we should call them "MOUNT_APPLY_APIVFS_RO" and "MOUNT_APPLY_APIVFS_NETNS" or so... dunno.

agreed

sorry for the confusoin around this...

No worries :-)

@evverx
Copy link
Contributor

evverx commented Oct 19, 2016

systemd-nspawn -D ./cont -U -b
...
Selected user namespace base 1605697536 and range 65536.
Timezone Etc/UTC does not exist in container, not updating container timezone.
Failed to write boot id: Permission denied

Seems like the outer child doesn't patch tmpfs options:

2599  mount("tmpfs", "/home/vagrant/systemd/./cont/run", "tmpfs", MS_NOSUID|MS_NODEV|MS_STRICTATIME, "mode=755") = 0

Should be something like

2890  mount("tmpfs", "/home/vagrant/systemd/./cont/run", "tmpfs", MS_NOSUID|MS_NODEV|MS_STRICTATIME, "mode=755,uid=1605697536,gid=1605697536") = 0

@s-urbaniak
Copy link
Contributor Author

@evverx gah, thanks, already spotted my mistake.

int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
const char *full, *top, *x;
int r;
long extra_flags = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned long

{ NULL, "/proc/sys", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_FATAL|MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO }, /* ... then, make it r/o */
{ "/proc/sysrq-trigger", "/proc/sysrq-trigger", NULL, NULL, MS_BIND, MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO }, /* Bind mount first ...*/
{ NULL, "/proc/sysrq-trigger", NULL, NULL, MS_BIND|MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_REMOUNT, MOUNT_IN_USERNS|MOUNT_APPLY_APIVFS_RO }, /* ... then, make it r/o */
{ "tmpfs", "/tmp", "tmpfs", "mode=1777", MS_STRICTATIME, MOUNT_FATAL|MOUNT_IN_USERNS },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this table is finally legible.

const char *o;
bool fatal = (mount_table[k].mount_settings & MOUNT_FATAL);

if ((mount_settings ^ mount_table[k].mount_settings) & MOUNT_IN_USERNS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno, I don't find those bitmask manipulations very clear. I think it would be easier to read if this was written as
if ((mount_settings & MOUNT_IN_USERNS) != (mount_table[k].mount_settings & MOUNT_IN_USERNS))


if (!use_netns && mount_table[k].use_netns)
if (~mount_settings & mount_table[k].mount_settings & MOUNT_APPLY_APIVFS_RO)
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A helper variable would make the intent clear:

   bool leftover_flags = ~mount_settings & mount_table[k].mount_settings;
   if (leftover_flags & MOUNT_APPLY_APIVFS_NETNS ||
       leftover_flags & MOUNT_APPLY_APIVFS_RO)
          continue;

#include "cgroup-util.h"

typedef enum MountSettingsMask {
MOUNT_FATAL = 1 << 0, /* if set, a mount error are considered fatal */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammaro

MOUNT_FATAL = 1 << 0, /* if set, a mount error are considered fatal */
MOUNT_USE_USERNS = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */
MOUNT_IN_USERNS = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */
MOUNT_APPLY_APIVFS_RO = 1 << 3, /* if set, /proc/sys, and /sysfs will be mounted read-only, else read-write. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

MOUNT_USE_USERNS = 1 << 1, /* if set, the mounts are patched considering uid/gid shifts in a user namespace */
MOUNT_IN_USERNS = 1 << 2, /* if set, the mount is executed in the inner child, else in the outer child */
MOUNT_APPLY_APIVFS_RO = 1 << 3, /* if set, /proc/sys, and /sysfs will be mounted read-only, else read-write. */
MOUNT_APPLY_APIVFS_NETNS = 1 << 4, /* if set, /proc/sys/net will be mounted read-write. Works only if MOUNT_APPLY_APIVFS_NETNS is also set. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap the comment?

arg_clone_ns_flags = (arg_clone_ns_flags & ~ns_flag) | (r > 0 ? 0 : ns_flag);
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious empty line.

return;
}

r = getenv_bool("SYSTEMD_NSPAWN_API_VFS_WRITABLE");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a minor thing, but querying the envvar twice is ugly. Use

r = parse_boolean(e);

instead.

arg_mount_settings &= ~MOUNT_APPLY_APIVFS_NETNS;
} else {
arg_mount_settings |= MOUNT_APPLY_APIVFS_RO;
arg_mount_settings &= ~MOUNT_APPLY_APIVFS_NETNS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentational that lines 403 and 406 are the same? If yes, it should be moved below the else block.

@keszybz keszybz added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Oct 20, 2016
@s-urbaniak
Copy link
Contributor Author

@keszybz thanks for the review! I'll address the comments in the next round.

@evverx
Copy link
Contributor

evverx commented Oct 21, 2016

@s-urbaniak , I think some combinations doesn't make sense, right?

+ UNIFIED_CGROUP_HIERARCHY=no
+ SYSTEMD_NSPAWN_USE_CGNS=no
+ SYSTEMD_NSPAWN_API_VFS_WRITABLE=network
+ systemd-nspawn --register=no -D /var/lib/machines/unified-no-cgns-no-apivfs-network -U -b
Spawning container unified-no-cgns-no-apivfs-network on /var/lib/machines/unified-no-cgns-no-apivfs-network.
...
Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")...
Mounting sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC "")...
Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): Operation not permitted
Failed to chown() cgroup file cgroup.events, ignoring: No such file or directory
Failed to chown() cgroup file cgroup.controllers, ignoring: No such file or directory
Failed to chown() cgroup file cgroup.subtree_control, ignoring: No such file or directory
Umounting /tmp/unifiedQhn3DL...
Failed to chown() cgroup file tasks, ignoring: No such file or directory
Failed to chown() cgroup file notify_on_release, ignoring: No such file or directory
Failed to chown() cgroup file cgroup.clone_children, ignoring: No such file or directory
Child died too early.

Shouldn't we call the verify_arguments or so?

@s-urbaniak
Copy link
Contributor Author

@evverx hmm, I need to have a more detailed look into this (, but won't make until the beginning of the next week.)

@s-urbaniak
Copy link
Contributor Author

@evverx hmm, I can't reproduce this, I am recompiling my kernel now with user-namespace support, but at least without -U:

sudo UNIFIED_CGROUP_HIERARCHY=no SYSTEMD_NSPAWN_USE_CGNS=no SYSTEMD_NSPAWN_API_VFS_WRITABLE=network ~/src/systemd/systemd-nspawn -D ~/tmp/debian-sid

works for me. I am apparently missing something, but I don't see a connection between write-only /proc/sys/net, and the above settings (non-unified cgroup hierarchy, no cgroup namespace)?!

Can you provide me with a little more details on a setup that is reproducible for you?

Thanks!

PS: I am sur on freenode, also in #rkt, and #rkt-dev.

@s-urbaniak
Copy link
Contributor Author

ha, userns-enabled kernel now let's me reproduce me this ...

@evverx
Copy link
Contributor

evverx commented Nov 2, 2016

@s-urbaniak

but at least without -U ...
works for me

Indeed, this works without -U. I think -U breaks something.

Can you provide me with a little more details on a setup that is reproducible for you?

-bash-4.3# cat /proc/version
Linux version 4.7.9-200.fc24.x86_64 ([email protected]) (gcc version 6.2.1 20160916 (Red Hat 6.2.1-2) (GCC) ) #1 SMP Thu Oct 20 14:26:16 UTC 2016

-bash-4.3# UNIFIED_CGROUP_HIERARCHY=no SYSTEMD_NSPAWN_USE_CGNS=no SYSTEMD_NSPAWN_API_VFS_WRITABLE=network systemd-nspawn --register=no -D ./cont/ -U -b
Spawning container cont on /root/./cont.
Press ^] three times within 1s to kill container.
Child died too early.
Selected user namespace base 1374814208 and range 65536.
Failed to mount n/a on /root/./cont/sys/fs/selinux (MS_BIND ""): No such file or directory
Failed to mount n/a on /root/./cont/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND ""): Invalid argument
Timezone Etc/UTC does not exist in container, not updating container timezone.
Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC ""): Operation not permitted
# inner child
407   clone(child_stack=NULL, flags=CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|SIGCHLD) = 439
...
# outer child
439   mkdir("/proc/sysrq-trigger", 0755) = -1 EEXIST (File exists)
439   stat("/proc/sysrq-trigger", {st_dev=makedev(0, 46), st_ino=4026532058, st_mode=S_IFREG|0200, st_nlink=1, st_uid=65534, st_gid=65534, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2016/11/02-07:0
439   mount(NULL, "/proc/sysrq-trigger", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0
439   lstat("/tmp", 0x7ffca702fb20)     = -1 ENOENT (No such file or directory)
439   mkdir("/tmp", 0755)               = 0
439   mount("tmpfs", "/tmp", "tmpfs", MS_STRICTATIME, "mode=1777,uid=0,gid=0") = 0
439   open("/sys", O_RDONLY)            = 9</sys>
439   fstatfs(9</sys>, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=62592, f_bfree=62592, f_bavail=62592, f_files=62592, f_ffree=62589, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSU
439   close(9</sys>)                    = 0
439   mkdir("/sys/full", 0755)          = 0
439   mount("sysfs", "/sys/full", "sysfs", MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)
439   writev(2</0>, [{iov_base="Failed to mount sysfs on /sys/full (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC \"\"): Operation not permitted", iov_len=103}, {iov_base="\n", iov_len=1}], 2) = 104
439   exit_group(1)                     = ?

@poettering
Copy link
Member

(not that it matters, the commit msg talks about "/sysfs" currently, which should be "/sys" of course...)

@evverx
Copy link
Contributor

evverx commented Nov 17, 2016

I'm reading rkt/rkt#3368

Presumably ditch nspawn eventually
...
Transitioning from nspawn to runc for stage1-spawning will help supporting some more cases

So, do we really need this feature at all? Who will use it?
@poettering , @s-urbaniak , @lucab ?

s-urbaniak and others added 2 commits November 18, 2016 09:50
This commit adds the possibility to leave /sys, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.

If set to "yes", /sys, and /proc/sys will be read-write.
If set to "no", /sys, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.

This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.

This is also a follow-up on the discussion from
systemd#4018 (comment) where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
@lucab
Copy link
Contributor

lucab commented Nov 18, 2016

So, do we really need this feature at all? Who will use it?

Just speaking for rkt (as I don't know if other people have the same needs): nspawn-based stage1 will use it. kvm-based stage1 doesn't need it. An hypothetical runc-based stage1 will not need it. But:

  • we only have nspawn&kvm stage1 at the moment, and no direct plan for a runc-based stage1 right now. See Investigate integrating Runc rkt/rkt#3327 (comment) for a better explanation.
  • the comment you quoted is related to a broader discussion about incompatible pod-level and app-level features in rkt. @euank suggested that using runc instead of nspawn would solve it, and I replied that it would solve something, but the general problem still exists.

TLDR: rkt may consider a runc-based stage1 at some point, which would come with its own set of pros and cons, quite a bit work, and still need initial investigation. nspawn-based stage1 is what we consider stable and we keep working to make it a bit more flexible. This PR goes into that direction, solves a current issue we have, and may (or may not) be useful for anybody else using an externally prepared net-ns.

(sorry for stealing this PR for rkt roadmapping :)

@poettering poettering removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 18, 2016
@poettering poettering added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Nov 18, 2016
@evverx
Copy link
Contributor

evverx commented Nov 18, 2016

@lucab , thanks for the explanation!

may be useful for anybody else using an externally prepared net-ns

Indeed, makes sense.

And my last question :-)
Why not to document SYSTEMD_NSPAWN_API_VFS_WRITABLE?

@lucab
Copy link
Contributor

lucab commented Nov 18, 2016

Why not to document SYSTEMD_NSPAWN_API_VFS_WRITABLE?

This follows what was recommended for the other clone-related flags here:
#2982 (comment).

@evverx
Copy link
Contributor

evverx commented Nov 18, 2016

Well, indeed, --share-system and friends

are pretty crazy, and we should not have them

But SYSTEMD_NSPAWN_API_VFS_WRITABLE doesn't look crazy (to me at least)

@poettering poettering merged commit f4ff4aa into systemd:master Nov 18, 2016
@poettering
Copy link
Member

@evverx well, it's not crazy, but it's also not the kind of stuff I want to push people towards, and make fully supported with guarantees to stay around forever in exactly the current form...

But you are right, we should have some kind of documentation for all the env vars we have collected now. But I am not sure it should be a proper user-facing man page, but more some text file (or markdown file) in the git tree, that lists all the special env vars we added over the time, and explains them briefly, but also makes clear that these aren't really primary use controls, but mostly exist with reduced stability gurantees and primarily for debug purposes, and for specific purposes of external software, but not so much for humans... Both nspawn and systemctl collected quite a few env vars over the times, so I figure a doc would be appropriate indeed.

@evverx
Copy link
Contributor

evverx commented Nov 18, 2016

But I am not sure it should be a proper user-facing man page, but more some text file (or markdown file) in the git tree, that lists all the special env vars we added over the time

git log -p :-)

@s-urbaniak , many thanks!

lucab added a commit to lucab/rkt that referenced this pull request Nov 22, 2016
systemd v232 introduced an hardcoded protection for /proc/sysrq-trigger,
so rkt insecure mode does not work anymore for that path.
Disabling the insecure part of TestPathsWrite for now, can be re-enabled
once systemd/systemd#4395 is available in stage1.
manover pushed a commit to manover/systemd that referenced this pull request Jan 21, 2017
xaiki pushed a commit to endlessm/systemd that referenced this pull request Feb 1, 2017
ddstreet pushed a commit to ddstreet/systemd that referenced this pull request Jul 4, 2019
Imported using git-ubuntu import.

Changelog parent: 018d813

New changelog entries:
  [ Martin Pitt ]
  * debian/tests/unit-config: Query pkg-config for system unit dir.
    This fixes confusion on merged-/usr systems where both /usr/lib/systemd and
    /lib/systemd exist. It's actually useful to verify that systemd.pc says the
    truth.
  * debian/tests/upstream: Fix clobbering of merged-/usr symlinks
  * debian/tests/systemd-fsckd: Create /etc/default/grub.d if necessary
  * debian/rules: Drop check for linking to libs in /usr.
    This was just an approximation, as booting without an initrd could still be
    broken by library updates (e. g. #828991). With merged /usr now being the
    default this is now completely moot.
  * Move kernel-install initrd script to a later prefix.
    60- does not leave much room for scripts that want to run before initrd
    building (which is usually one of the latest things to do), so bump to 85.
    Thanks to Sjoerd Simons for the suggestion.
  * Disable 99-default.link instead of the udev rule for disabling persistent interface names.
    Disabling 80-net-setup-link.rules will also cause ID_NET_DRIVER to not be set
    any more, which breaks 80-container-ve.network and matching on driver name in
    general. So disable the actual default link policy instead.
    Still keep testing for 80-net-setup-link.rules in the upgrade fix and
    73-usb-net-by-mac.rules to keep the desired behaviour on systems which already
    disabled ifnames via that udev rule.
    See https://lists.freedesktop.org/archives/systemd-devel/2016-November/037805.html
  * debian/tests/boot-and-services: Always run seccomp test
    seccomp is now available on all architectures on which Debian and Ubuntu run
    tests, so stop making this test silently skip if seccomp is disabled.
  * Bump libseccomp build dependency as per configure.ac.
  * Replace "Drop RestrictAddressFamilies=" patch with sed call.
    With that it will also apply to upstream builds/CI, and it is structurally
    simpler.
  * Rebuild against libseccomp with fixed shlibs. (Closes: #844497)
  [ Michael Biebl ]
  * fstab-generator: add x-systemd.mount-timeout option. (Closes: #843989)
  * build-sys: do not install ctrl-alt-del.target symlink twice.
    (Closes: #844039)
  * Enable lz4 support.
    While the compression rate is not as good as XZ, it is much faster, so a
    better default for the journal and especially systemd-coredump.
    (Closes: #832010)
  [ Felipe Sateler ]
  * Enable machines.target by default. (Closes: #806787)
  [ Evgeny Vereshchagin ]
  * debian/tests/upstream: Print all journal files.
    We don't print all journal files. This is misleading a bit:
    systemd#4331 (comment)
    systemd#4395 (comment)
  [ Luca Boccassi ]
  * Use mount --move in initramfs-tools udev script.
    Due to recent changes in busybox and initramfs-tools the mount
    utility is no longer the one from busybox but from util-linux.
    The latter does not support mount -o move.
    The former supports both -o move and --move, so use it instead to be
    compatible with both.
    See this discussion for more details:
    https://bugs.debian.org/823856 (Closes: #844775)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed nspawn

Development

Successfully merging this pull request may close these issues.

6 participants