Skip to content

nspawn: change the synced cgroup hierarchy permissions too#4223

Merged
keszybz merged 1 commit intosystemd:masterfrom
evverx:fix-synced-cgroup-hierarchy-perms
Oct 13, 2016
Merged

nspawn: change the synced cgroup hierarchy permissions too#4223
keszybz merged 1 commit intosystemd:masterfrom
evverx:fix-synced-cgroup-hierarchy-perms

Conversation

@evverx
Copy link
Contributor

@evverx evverx commented Sep 27, 2016

Fixes: #4181

@evverx evverx force-pushed the fix-synced-cgroup-hierarchy-perms branch from b496d8d to 29e890c Compare September 27, 2016 01:35
@evverx evverx added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 1, 2016
@evverx evverx force-pushed the fix-synced-cgroup-hierarchy-perms branch from 29e890c to 7281ab6 Compare October 1, 2016 09:26
@evverx evverx added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Oct 1, 2016
is_fs_type(s, RAMFS_MAGIC);
is_fs_type(s, RAMFS_MAGIC) ||
is_fs_type(s, CGROUP_SUPER_MAGIC) ||
is_fs_type(s, CGROUP2_SUPER_MAGIC);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm_rf(path, REMOVE_ROOT|REMOVE_ONLY_DIRECTORIES) doesn't work if path on the cgroup[2]-fs. I think we should fix the rm_rf instead. But let's check that this PR really fixes #4181 first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i figure it would make sense to add a cg_is_cgroupfs() call to cgroup-util.h, and then call that from rm-rf.c and also allow it.

@evverx evverx removed needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Oct 1, 2016
undo_mount = true;

fn = strjoina(tree, cgroup);
(void) rm_rf(fn, REMOVE_ROOT|REMOVE_ONLY_DIRECTORIES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes #4181 (comment)

How to reproduce:

# create the new container in some-dir
systemd-nspawn -D some-dir -b
...
^]]]

systemd-nspawn -D some-dir -b -U
...

Harder way to reproduce:

# create the new container in some-dir
systemd-nspawn -D some-dir -b
...
login:
...

# at another teminal
ps -C 'systemd-nspawn'
# find the systemd-nspawn-pid
kill -9 <systemd-nspawn-pid>
systemd-nspawn -D some-dir -b -U

Copy link
Member

@keszybz keszybz Oct 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce this. With the first recipe, things work just fine (without the rm_rf part). At least with the host running unified. I'll reboot and test with the host running mixed later on.

With the second recipe, the container survives the death of nspawn. But in this case we have a mess, so it's probably better if things fail.

Copy link
Contributor Author

@evverx evverx Oct 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is strange.
Did you run ^]]]? . i.e.

Container master terminated by signal KILL.

What is the output of the

mkdir cgroup
mount -t cgroup cgroup -o none,name=systemd,xattr cgroup
ls -lR cgroup/

after the first run of the systemd-nspawn -D ... -b (without -U)
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the second recipe

Sorry, I was wrong. Actually, I mean kill -9 pid1-of-the-container

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evverx you are saying that if nspawn dies abruptly the cgroup hierarchy created below its unit isn't cleaned up? that's somethig PID 1 really should take care of, and we should fix it there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see now, this is because in this case PID 1 on the host of course won't take care of the other hierarchy, i see.

@keszybz
Copy link
Member

keszybz commented Oct 9, 2016

I'm getting a bunch of failures when running F26 inside systemd-nspawn -U:

  1. systemd-tmpfiles-setup.service:
    Failed to adjust quota for subvolume "/var/lib/machines": Operation not permitted
  2. dev-hugepages.mount, sys-fs-fuse-connections.mount
    mount fails with Permission denied.
  3. systemd-resolved.service:
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 13
setsockopt(13, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(13, SOL_IP, IP_PKTINFO, [1], 4) = 0
setsockopt(13, SOL_IP, IP_RECVTTL, [1], 4) = 0
setsockopt(13, SOL_SOCKET, SO_BINDTODEVICE, "lo\0", 3) = -1 EPERM (Operation not permitted)

It looks like we need to tighten our Condition checks for the first two. The third looks "interesting".

Also on shutdown:

[FAILED] Failed unmounting /etc/resolv.conf.
[FAILED] Failed unmounting /run/systemd/nspawn/incoming.
[FAILED] Failed unmounting Temporary Directory.

@evverx
Copy link
Contributor Author

evverx commented Oct 9, 2016

@keszybz , right. This is a user namespaces problems :( I guess we turn on -U by default too early. For eaxmple, see #4078)

systemd-resolved.service:

Did you run systemd-nspawn --private-network -U ... or so? Or simply systemd-nspawn -U?

@evverx
Copy link
Contributor Author

evverx commented Oct 9, 2016

dev-hugepages.mount, sys-fs-fuse-connections.mount
mount fails with Permission denied

There is a bug report: #4280 (comment)

@keszybz
Copy link
Member

keszybz commented Oct 9, 2016

@evverx good links, thanks. I'm getting lost in all the open bugs, but you seem to have it under control ;)

If you have time, please ack #4319 which has some small fixes on the side.

@keszybz
Copy link
Member

keszybz commented Oct 9, 2016

@evverx I would merge your original patch, but I'm not convinced about the rm_rf part — it seems like the fix should be elsewhere. But even if it is the correct thing to do, I think it should be a separate commit with a good commit message; currently that change is hidden in that large diff. So can you resubmit the first version which only does the chown?

Wrt. to #4280, it doesn't seem to improve things for me.

@keszybz keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 9, 2016
@evverx
Copy link
Contributor Author

evverx commented Oct 9, 2016

But even if it is the correct thing to do, I think it should be a separate commit with a good commit message

Makes sense

So can you resubmit the first version which only does the chown?

But I wonder how does this patch work for you?

diff --git a/src/nspawn/nspawn-cgroup.c b/src/nspawn/nspawn-cgroup.c
index 8036323..c22393b 100644
--- a/src/nspawn/nspawn-cgroup.c
+++ b/src/nspawn/nspawn-cgroup.c
@@ -111,9 +111,6 @@ int sync_cgroup(pid_t pid, CGroupUnified unified_requested, uid_t arg_uid_shift)

         undo_mount = true;

-        fn = strjoina(tree, cgroup);
-        (void) rm_rf(fn, REMOVE_ROOT|REMOVE_ONLY_DIRECTORIES);
-

Environment:

root# cat /proc/version
Linux version 4.7.5-200.fc24.x86_64 ([email protected]) (gcc version 6.2.1 20160916 (Red Hat 6.2.1-2) (GCC) ) #1 SMP Mon Sep 26 21:25:47 UTC 2016

root# grep cgroup /proc/self/mountinfo
23 17 0:20 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:7 - cgroup2 cgroup rw

Let's run our container without -U

root# UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn -D /var/lib/machines/master/  -b systemd.unit=multi-user.target
...
[  OK  ] Started Update UTMP about System Runlevel Changes.

systemd-testsuite login:
^]]]
Container master terminated by signal KILL.

Let's check the v1-hierarchy

root# mkdir cgroup
root# mount -t cgroup -o none,name=systemd,xattr cgroup cgroup
root# ls -lR cgroup
...
cgroup/machine.slice/machine-master.scope/system.slice/systemd-user-sessions.service:
total 0
-rw-r--r-- 1 root root 0 Oct  8 22:58 cgroup.clone_children
-rw-r--r-- 1 root root 0 Oct  8 22:58 cgroup.procs
-rw-r--r-- 1 root root 0 Oct  8 22:58 notify_on_release
-rw-r--r-- 1 root root 0 Oct  8 22:58 tasks

cgroup/machine.slice/machine-master.scope/system.slice/tmp.mount:
total 0
-rw-r--r-- 1 root root 0 Oct  8 22:58 cgroup.clone_children
-rw-r--r-- 1 root root 0 Oct  8 22:58 cgroup.procs
-rw-r--r-- 1 root root 0 Oct  8 22:58 notify_on_release
-rw-r--r-- 1 root root 0 Oct  8 22:58 tasks

cgroup/machine.slice/machine-master.scope/user.slice:
total 0
-rw-r--r-- 1 root root 0 Oct  8 22:58 cgroup.clone_children
-rw-r--r-- 1 root root 0 Oct  8 22:58 cgroup.procs
-rw-r--r-- 1 root root 0 Oct  8 22:58 notify_on_release
-rw-r--r-- 1 root root 0 Oct  8 22:58 tasks

Let's run our container with -U

root# UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn -D /var/lib/machines/master/ -U -b systemd.unit=multi-user.target
...
Detected virtualization systemd-nspawn.
Detected architecture x86-64.

Welcome to Fedora 24 (Cloud Edition)!

Set hostname to <systemd-testsuite>.
Using cgroup controller name=systemd. File system hierarchy is at /sys/fs/cgroup/systemd.
Failed to install release agent, ignoring: No such file or directory
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object, freezing.
Freezing execution.

@evverx
Copy link
Contributor Author

evverx commented Oct 9, 2016

Wrt. to #4280, it doesn't seem to improve things for me.

Well, #4280 improves one particular case: systemd, systemd-journald, apache and some helpers inside the userns container: https://asciinema.org/a/7gex3vdcumd1n72i3knstra2b

@keszybz
Copy link
Member

keszybz commented Oct 10, 2016

Hm, with your latest recipe and -U, I get:

$ sudo UNIFIED_CGROUP_HIERARCHY=no LD_LIBRARY_PATH=.libs .libs/systemd-nspawn -D /var/lib/machines/fedora-rawhide -U -b systemd.unit=multi-user.target

Spawning container fedora-rawhide on /var/lib/machines/fedora-rawhide.
Press ^] three times within 1s to kill container.
Child died too early.
Selected user namespace base 1073283072 and range 65536.
Failed to mount to /sys/fs/cgroup/systemd: No such file or directory

I guess that's because some mkdir fails on the way. But that's not related to rm_rf, right?

@keszybz
Copy link
Member

keszybz commented Oct 10, 2016

Also note #4323.

@evverx
Copy link
Contributor Author

evverx commented Oct 10, 2016

Failed to mount to /sys/fs/cgroup/systemd: No such file or directory

Hm
What is the output of the

grep cgroup /proc/self/mountinfo
cat /proc/cgroups
cat /proc/1/cgroup

on the host?

Also, does unshare -C work?

@evverx
Copy link
Contributor Author

evverx commented Oct 10, 2016

strace -f -yy -s 500 -v will help too

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks all good to me. I'd prefer if we'd move the cgroupfs check currentl in is_temporary_fs into a new function is_cgroupfs() or so, and place that in cgroup-util. rm_rf() would then also call that, or so...

Patch looks great otherwise!

is_fs_type(s, RAMFS_MAGIC);
is_fs_type(s, RAMFS_MAGIC) ||
is_fs_type(s, CGROUP_SUPER_MAGIC) ||
is_fs_type(s, CGROUP2_SUPER_MAGIC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i figure it would make sense to add a cg_is_cgroupfs() call to cgroup-util.h, and then call that from rm-rf.c and also allow it.

undo_mount = true;

fn = strjoina(tree, cgroup);
(void) rm_rf(fn, REMOVE_ROOT|REMOVE_ONLY_DIRECTORIES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evverx you are saying that if nspawn dies abruptly the cgroup hierarchy created below its unit isn't cleaned up? that's somethig PID 1 really should take care of, and we should fix it there.

undo_mount = true;

fn = strjoina(tree, cgroup);
(void) rm_rf(fn, REMOVE_ROOT|REMOVE_ONLY_DIRECTORIES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see now, this is because in this case PID 1 on the host of course won't take care of the other hierarchy, i see.

@poettering
Copy link
Member

Also needs rebase.

@poettering
Copy link
Member

Thanks for working on this!

@evverx
Copy link
Contributor Author

evverx commented Oct 11, 2016

I'd prefer if we'd move the cgroupfs check currentl in is_temporary_fs into a new function is_cgroupfs() or so, and place that in cgroup-util. rm_rf() would then also call that, or so

Also needs rebase.

Will do. Thanks!

Failed to mount to /sys/fs/cgroup/systemd: No such file or directory

@keszybz , I can't reproduce. I stuck. Am I correct:

  • host: rawhide/unified-hierarchy
  • container: rawhide/legacy-hierarchy
  • systemd-nspawn: my PR branch

?

@keszybz
Copy link
Member

keszybz commented Oct 11, 2016

So... I now added #4351, which adds more debug statements to make it easier to see what is happening.

The issue I'm seeing is this:
I'm running full unified on the host, but requesting legacy in the container:

$ sudo SYSTEMD_LOG_LEVEL=debug UNIFIED_CGROUP_HIERARCHY=no LD_LIBRARY_PATH=build/.libs build/.libs/systemd-nspawn -M rawhide -b -U
...
Mounting tmpfs on /sys/fs/cgroup (MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_STRICTATIME "mode=777,uid=0,gid=0")...

Now mountinfo contains

132 187 0:41 / /sys/fs/cgroup rw,nosuid,nodev,noexec - tmpfs tmpfs rw,seclabel,mode=777,uid=1148125184,gid=1148125184

But the process does not have write permissions for /sys/fs/cgroup:

mkdir_p(/sys/fs/cgroup/systemd): Value too large for defined data type
Mounting cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NODEV|MS_NOEXEC "none,name=systemd,xattr")...
Failed to mount cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NODEV|MS_NOEXEC "none,name=systemd,xattr"): No such file or directory

I added an assert to kill the process when the mount fails. The backtrace is:

(gdb) bt
#0  0x00007f91bbb18692 in abort () from /lib64/libc.so.6
#1  0x00007f91bcbf0a19 in log_assert_failed (text=0x55bb116059b2 "r == 0", file=0x55bb11605501 "../src/nspawn/nspawn-mount.c", line=706, func=0x55bb11605ed0 <__PRETTY_FUNCTION__.10072> "mount_legacy_cgroup_hierarchy") at ../src/basic/log.c:795
#2  0x000055bb115f113a in mount_legacy_cgroup_hierarchy (dest=0x55bb11605500 "", controller=0x55bb116058e4 "name=systemd", hierarchy=0x55bb11605659 "systemd", unified_requested=CGROUP_UNIFIED_NONE, read_only=false) at ../src/nspawn/nspawn-mount.c:706
#3  0x000055bb115f174a in mount_legacy_cgns_supported (unified_requested=CGROUP_UNIFIED_NONE, userns=true, uid_shift=1148125184, uid_range=65536, selinux_apifs_context=0x0) at ../src/nspawn/nspawn-mount.c:809
#4  0x000055bb115f262c in mount_cgroups (dest=0x55bb116033e3 "", unified_requested=CGROUP_UNIFIED_NONE, userns=true, uid_shift=1148125184, uid_range=65536, selinux_apifs_context=0x0, use_cgns=true) at ../src/nspawn/nspawn-mount.c:954
#5  0x000055bb115e4ad8 in inner_child (barrier=0x7ffcfe2bb810, directory=0x55bb129cd080 "/var/lib/machines/rawhide", secondary=false, kmsg_socket=12, rtnl_socket=14, fds=0x0) at ../src/nspawn/nspawn.c:2712
#6  0x000055bb115e6b69 in outer_child (barrier=0x7ffcfe2bb810, directory=0x55bb129cd080 "/var/lib/machines/rawhide", console=0x55bb129cd0b0 "/dev/pts/3", root_device=0x0, root_device_rw=true, home_device=0x0, home_device_rw=true, srv_device=0x0, srv_device_rw=true, esp_device=0x0, interactive=true, secondary=false, pid_socket=-1, uuid_socket=-1, notify_socket=-1, kmsg_socket=12, rtnl_socket=14, uid_shift_socket=-1, fds=0x0) at ../src/nspawn/nspawn.c:3163
#7  0x000055bb115e93e8 in run (master=-1, console=0x55bb129cd0b0 "/dev/pts/3", root_device=0x0, root_device_rw=true, home_device=0x0, home_device_rw=true, srv_device=0x0, srv_device_rw=true, esp_device=0x0, interactive=true, secondary=false, fds=0x0, veth_name=0x7ffcfe2bbb10 "", veth_created=0x7ffcfe2bb975, exposed=0x7ffcfe2bbaf0, pid=0x7ffcfe2bb988, ret=0x7ffcfe2bb984) at ../src/nspawn/nspawn.c:3707
#8  0x000055bb115ebb38 in main (argc=5, argv=0x7ffcfe2bbc48) at ../src/nspawn/nspawn.c:4250

I guess that's a corner case that we can leave for later. It's not directly related (I think) to this PR.

@keszybz
Copy link
Member

keszybz commented Oct 11, 2016

I forgot to add: the result is the same with this PR and without.

@evverx
Copy link
Contributor Author

evverx commented Oct 12, 2016

@keszybz , thanks!

I've reproduced. This is the result of the torvalds/linux@036d523#diff-003a2c1efe8cf5587b2796a966c15b1c

I now added #4351

I'll take a look tomorrow (on Thursday). Thanks!

@evverx evverx force-pushed the fix-synced-cgroup-hierarchy-perms branch from 7281ab6 to b37ae90 Compare October 12, 2016 20:08
@evverx
Copy link
Contributor Author

evverx commented Oct 12, 2016

@poettering , @keszybz , updated.

#include "string-util.h"

static bool is_physical_fs(const struct statfs *sfs) {
return !is_temporary_fs(sfs) && !is_cgroup_fs(sfs);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: I think we should add more filesystems here.

@keszybz keszybz merged commit f0bef27 into systemd:master Oct 13, 2016
@evverx evverx deleted the fix-synced-cgroup-hierarchy-perms branch October 13, 2016 18:06
evverx added a commit to evverx/systemd that referenced this pull request Oct 15, 2016
Basically, this test runs:
```
    systemd-nspawn --register=no -D "$_root" -b
    systemd-nspawn --register=no -D "$_root" --private-network -b
    systemd-nspawn --register=no -D "$_root" -U -b
    systemd-nspawn --register=no -D "$_root" --private-network -U -b
```
and exports the `UNIFIED_CGROUP_HIERARCHY=[yes|no]`, `SYSTEMD_NSPAWN_USE_CGNS=[yes|no]`

Inspired by
* systemd#3589 (comment)
* systemd#4372 (comment)
* systemd#4223 (comment)
* systemd#1555

and so on :-)
evverx added a commit to evverx/systemd that referenced this pull request Oct 18, 2016
Basically, this test runs:
```
    systemd-nspawn --register=no -D "$_root" -b
    systemd-nspawn --register=no -D "$_root" --private-network -b
    systemd-nspawn --register=no -D "$_root" -U -b
    systemd-nspawn --register=no -D "$_root" --private-network -U -b
```
and exports the `UNIFIED_CGROUP_HIERARCHY=[yes|no]`, `SYSTEMD_NSPAWN_USE_CGNS=[yes|no]`

Inspired by
* systemd#3589 (comment)
* systemd#4372 (comment)
* systemd#4223 (comment)
* systemd#1555

and so on :-)
@keszybz keszybz removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 24, 2016
keszybz pushed a commit that referenced this pull request Oct 24, 2016
torvalds/linux@036d523

> vfs: Don't create inodes with a uid or gid unknown to the vfs
  It is expected that filesystems can not represent uids and gids from
  outside of their user namespace.  Keep things simple by not even
  trying to create filesystem nodes with non-sense uids and gids.

So, we actually should `reset_uid_gid` early to prevent #4223 (comment)

$ sudo UNIFIED_CGROUP_HIERARCHY=no LD_LIBRARY_PATH=.libs .libs/systemd-nspawn -D /var/lib/machines/fedora-rawhide -U -b systemd.unit=multi-user.target

Spawning container fedora-rawhide on /var/lib/machines/fedora-rawhide.
Press ^] three times within 1s to kill container.
Child died too early.
Selected user namespace base 1073283072 and range 65536.
Failed to mount to /sys/fs/cgroup/systemd: No such file or directory

Details: #4223 (comment)
Fixes: #4352
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants