Cgroup namespace by brauner · Pull Request #3589 · systemd/systemd

brauner · 2016-06-23T11:55:01Z

This adds support for cgroup namespaces which are available since 4.6. Cgroup namespaces work with both, the legacy and unified cgroup hierarchy. For legacy:

Inside new cgroup namespace:

sudo unshare --cgroup

conventiont:/home/chb # cat /proc/self/cgroup
11:blkio:/
10:memory:/
9:net_cls,net_prio:/
8:hugetlb:/
7:cpu,cpuacct:/
6:freezer:/
5:cpuset:/
4:pids:/
3:devices:/
2:perf_event:/
1:name=systemd:/

conventiont:/home/chb # ls -al /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 Jun 23 13:52 /proc/self/ns/cgroup -> cgroup:[4026532485]

Parent cgroup namespace:

[chb@conventiont ~]$ cat /proc/self/cgroup
11:blkio:/user.slice
10:memory:/user.slice
9:net_cls,net_prio:/user.slice
8:hugetlb:/
7:cpu,cpuacct:/user.slice
6:freezer:/
5:cpuset:/
4:pids:/user.slice/user-1000.slice/session-1.scope
3:devices:/user.slice
2:perf_event:/
1:name=systemd:/user.slice/user-1000.slice/session-1.scope

[chb@conventiont ~]$ ls -al /proc/self/ns/cgroup
lrwxrwxrwx 1 chb users 0 Jun 23 13:52 /proc/self/ns/cgroup -> 'cgroup:[4026531835]'

brauner · 2016-06-23T11:57:13Z

Just saw: related to #2112.

poettering · 2016-06-23T16:49:13Z

src/basic/cgroup-util.c


+bool cg_ns_supported(void)
+{
+	return access("/proc/self/ns/cgroup", F_OK) == 0;


nitpick: please follow the usual coding style, and place the opening bracket on the same line as the function name. i.e.:

bool cg_ns_supported(void) { …

Sorry, was irritated by the function definition directly above.

poettering · 2016-06-23T16:56:42Z

looks pretty good. mostly minor issues.

(oh, one more thing: we don't use Signed-off-by in systemd, that's a kernel thing)

martinpitt · 2016-06-24T05:58:16Z

This also seems to break nspawn, see the autopkgtest log for the failed "build-and-services" and the first "upstream" nspawn test:

Jun 23 23:59:03 adt systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 23 23:59:03 adt systemd-nspawn[1482]: Failed to mount /sys/fs/cgroup: Invalid argument
Jun 23 23:59:03 adt systemd[1]: Failed to start Container c1.
Jun 23 23:59:03 adt systemd-nspawn[1482]: Child died too early.

evverx · 2016-06-24T06:38:36Z

Failed to mount /sys/fs/cgroup: Invalid argument

Yeah, this is on 4.4

On 4.6:

root# uname -r
4.6.2-1-ARCH

root# env UNIFIED_CGROUP_HIERARCHY=no ../../systemd-nspawn --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.6Rcdc5/nspawn-root /usr/lib/systemd/systemd systemd.unit=multi-user.target
Spawning container nspawn-root on /var/tmp/systemd-test.6Rcdc5/nspawn-root.
Press ^] three times within 1s to kill container.
Child died too early.
Failed to read link /sys/fs/cgroup/cpu: No such file or directory

brauner · 2016-06-24T06:48:03Z

I'm on this. Sorry.

evverx · 2016-06-24T06:58:28Z

~~@brauner , oh, sorry, I was wrong.~~
~~This works on 4.6. (I didn't update the whole systemd, only systemd-nspawn)~~
I tested master by mistake.

So, yeah,

Failed to read link /sys/fs/cgroup/cpu: No such file or directory

martinpitt · 2016-06-24T09:05:22Z

Note that the Ubuntu 4.4 kernels has the cgroup namespace feature backported, as we use it for LXD. If that's somehow incomplete, I can move the testing to a newer kernel (with some additional overhead). However, AFAIR current systemd policy is that things should generally work with kernels ≤ 2 years old, so things should at least have a reasonable fallback.

evverx · 2016-06-24T09:33:14Z

Note that the Ubuntu 4.4 kernels has the cgroup namespace feature backported

Oh, indeed:

$ uname -a
Linux ubuntu-yakkety 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ sudo unshare -C
root@ubuntu-yakkety:~# cat /proc/self/cgroup
11:pids:/
10:cpuset:/
9:blkio:/
8:devices:/
7:memory:/
6:perf_event:/
5:hugetlb:/
4:cpu,cpuacct:/
3:freezer:/
2:net_cls,net_prio:/
1:name=systemd:/

Good to know, thanks!

I've checked this patch on Fedora 24:

$ uname -r
4.5.7-300.fc24.x86_64

$ sudo strace unshare -C
...
unshare(CLONE_NEWCGROUP)                = -1 EINVAL (Invalid argument)
...
$ sudo systemd-nspawn -D /nspawn-root -b 3
...works fine...

brauner · 2016-06-24T10:36:01Z

So I misconstrued how systemd-nspawn handles mounting cgroups at first read. I think what we simply can do is unshare(CLONE_NEWCGROUP) in the inner child after the cgroups have been mounted. The problem is then that from an information-leak point of view you get cat /proc/self/cgroup to always show / while you still have access to the whole root cgroup tree under /sys/fs/cgrou because of how systemd currently does the mounting. In lxc we hide the root cgroup tree as well. This would probably mean a more invasive change here.

brauner · 2016-06-24T10:36:13Z

Oh, and thanks for the feedback!

brauner · 2016-06-24T18:33:31Z

So here is how I implemented it so far: When cgroup namespaces are enabled we unshare the cgroup namespace after all limits and so on have been applied but we do not mount cgroups since that is unnecessary with cgroup namespaces and only causes information leak. We should then be correctly placed in the right cgroups when we do cat /proc/self/cgroup and should only see our root cgroup and not our parent cgroup under /sys/fs/cgroup. I have tested this with the legacy cgroup hierarchy and it works fine.

evverx · 2016-06-25T07:23:54Z

src/nspawn/nspawn.c

-                return r;
+	if (cg_ns_supported()) {
+		r = unshare(CLONE_NEWCGROUP);
+		if (r < 0)


Well, systemd-nspawn doesn't fail on startup. But this breaks UNIFIED_CGROUP_HIERARCHY:

nspawn understands the $UNIFIED_CGROUP_HIERARCHY
environment variable to individually select the hierarchy to
use for executed containers. By default, nspawn will use the
unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

-bash-4.3# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 -bash-4.3# UNIFIED_CGROUP_HIERARCHY=yes systemd-nspawn -D /nspawn-root/ -b 3 ... container# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0

Also, this works strange with the unified hierarchy:

-bash-4.3# grep cgroup /proc/self/mounts cgroup /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 -bash-4.3# unshare -C cat /proc/self/cgroup 0::/ -bash-4.3# systemd-nspawn -D /nspawn-root -b 3 ... container# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 container# cat /proc/1/cgroup 9:cpuset:/ 8:devices:/init.scope 7:cpu,cpuacct:/init.scope 6:net_cls:/ 5:freezer:/ 1:name=systemd:/init.scope 0::/

Thanks for testing on unified @evverx. Starting unified with systemd-nspawn didn't work for me with v228 independent of the patch. So maybe I need to test that from master again.

Your second point I'm not entirely clear what you're getting at. In the case of cgroup namespaces the container will be able to mount a cgroup filesystem by itself just as on normal system bootup. So we don't need to bind-mount, I think. If you're getting at the point about some subsystems missing in the container. This is explained by how cgroup v1 and v2 interact I think: As you have mounted cgroup2 on the host you likely have mounted the available subsystems memory, pid etc. into the v2 hierarchy which means that they are not mounted into the v1 hierarchy. This is why they do not appear in the container which checks the available controllers in the v1 hierarchy.

@poettering would you prefer a different approach?

This is why they do not appear in the container which checks the available controllers in the v1 hierarchy.

But why do we need to check the v1-controllers on the v2-hierarchy?

In the case of cgroup namespaces the container will be able to mount a cgroup filesystem by itself just as on normal system bootup.

Yeah. But we shouldn't mount v1 on v2 (and vice versa)

master:

-bash-4.3# systemd-nspawn -D /nspawn-root -b 3 ... container# grep cgroup /proc/self/mounts cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 container# cat /proc/1/cgroup 0::/machine.slice/machine-nspawn\x2droot.scope/init.scope

So, something went wrong here:

int mount_cgroup_controllers(char ***join_controllers) { _cleanup_set_free_free_ Set *controllers = NULL; int r; if (!cg_is_legacy_wanted()) return 0; /* Mount all available cgroup controllers that are built into the kernel. */ controllers = set_new(&string_hash_ops); if (!controllers) return log_oom();

cg_is_legacy_wanted should return 0

On Sat, Jun 25, 2016 at 12:24:00AM -0700, Evgeny Vereshchagin wrote:

@@ -2594,9 +2594,15 @@ static int inner_child(
return -ESRCH;
}

r = mount_systemd_cgroup_writable("", arg_unified_cgroup_hierarchy);

if (r < 0)

return r;

if (cg_ns_supported()) {

r = unshare(CLONE_NEWCGROUP);

if (r < 0)

Well, systemd-nspawn doesn't fail on startup. But this breaks UNIFIED_CGROUP_HIERARCHY:
This is an indirect consequence of cgroup namespaces. With cgroup namespaces the
container will mount the cgroupfs itself. Hence, mounting the cgroupfs is the
task of systemd inside the container as opposed to bind-mount magic when
cgroup namespaces are not available. If we want systemd inside the container to
mount the unified cgroup hierarchy the simplest solution is to pass
systemd.unified_cgroup_hierarchy=1 as argument to systemd-nspawn:

systemd-nspawn -D /some/rootfs -b 'systemd.unified_cgroup_hierarchy=1'

To be backwards compatible with prior systemd-nspawn versions that allow
setting the UNIFIED_CGROUP_HIERARCHY env variable we can simply append
systemd.unified_cgroup_hierarchy=1. However, when the user simply wants a
shell inside the container things get more complicated since there is no
systemd/init process that sets up the cgroupfs.

Minor point: Note also, that the systemd v230 release notes state that booting
unified cgroups with kernels >= 4.5 requires systemd v230. This is why I
had trouble using unified cgroups:

"WARNING: it is not possible to use previous systemd versions with systemd.unified_cgroup_hierarchy=1 and the new kernel. Therefore it is necessary to also update systemd in the initramfs if using the unified hierarchy. An updated SELinux policy is also required." (https://lists.freedesktop.org/archives/systemd-devel/2016-May/036583.html)

Since the cgroup namespaces patch here requires that systemd inside the
container mounts the cgroup it means that systemd v230 is required inside the
container with a kernel >=4.5.

nspawn understands the $UNIFIED_CGROUP_HIERARCHY
environment variable to individually select the hierarchy to
use for executed containers. By default, nspawn will use the
unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

-bash-4.3# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 -bash-4.3# UNIFIED_CGROUP_HIERARCHY=yes systemd-nspawn -D /nspawn-root/ -b 3 ... container# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 --- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/systemd/systemd/pull/3589/files/dd8e1b4bf0b4e6180812428053d6dfb97d66b4db#r68485591

On Sat, Jun 25, 2016 at 03:08:20AM -0700, Evgeny Vereshchagin wrote:

@@ -2594,9 +2594,15 @@ static int inner_child(
return -ESRCH;
}

r = mount_systemd_cgroup_writable("", arg_unified_cgroup_hierarchy);

if (r < 0)

return r;

if (cg_ns_supported()) {

r = unshare(CLONE_NEWCGROUP);

if (r < 0)

This is why they do not appear in the container which checks the available controllers in the v1 hierarchy.

But why do we need to check the v1-controllers on the v2-hierarchy?

In the case of cgroup namespaces the container will be able to mount a cgroup filesystem by itself just as on normal system bootup.

Yeah. But we shouldn't mount v1 on v2 (and vice versa)

master:

-bash-4.3# systemd-nspawn -D /nspawn-root -b 3 ... container# grep cgroup /proc/self/mounts cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 container# cat /proc/1/cgroup 0::/machine.slice/machine-nspawn\x2droot.scope/init.scope

I can reproduce this behavior with systemd master independent of this patch.
Sorry, I'm a little confused as to what you're getting at here.

Sorry, I'm a little confused as to what you're getting at here.

@brauner , sorry.
I mean:

By default, nspawn will use the unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

Your patch doesn't work as expected: #3589 (comment)

container# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

master works fine:

container# grep cgroup /proc/self/mounts cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

Yeah

systemd-nspawn -D /some/rootfs -b 'systemd.unified_cgroup_hierarchy=1'

mounts the v2-hierarchy. But we should do this by default (i.e. without systemd.unified_cgroup_hierarchy=1)

On Sun, Jun 26, 2016 at 12:52:33AM -0700, Evgeny Vereshchagin wrote:

@@ -2594,9 +2594,15 @@ static int inner_child(
return -ESRCH;
}

r = mount_systemd_cgroup_writable("", arg_unified_cgroup_hierarchy);

if (r < 0)

return r;

if (cg_ns_supported()) {

r = unshare(CLONE_NEWCGROUP);

if (r < 0)

Sorry, I'm a little confused as to what you're getting at here.

@brauner , sorry.
I mean:

By default, nspawn will use the unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

Your patch doesn't work as expected: #3589 (comment)

container# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

master works fine:

container# grep cgroup /proc/self/mounts cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

Yeah

systemd-nspawn -D /some/rootfs -b 'systemd.unified_cgroup_hierarchy=1'

mounts the v2-hierarchy. But we should do this by default (i.e. without systemd.unified_cgroup_hierarchy=1)

Thanks for the clarification, @evverx. Yes, I can think of a way to do this.
When we detect that unified is requested or used on the host we append
"systemd.unified_cgroup_hierarchy=1" to the arguments passed to the containers
init.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/systemd/systemd/pull/3589/files/dd8e1b4bf0b4e6180812428053d6dfb97d66b4db#r68498967

evverx · 2016-06-26T13:12:30Z

src/nspawn/nspawn.c

+	// legacy cgroup.
+	if (arg_unified_cgroup_hierarchy && cg_ns_supported() && arg_start_mode == START_BOOT) {
+		if (strv_extend(&arg_parameters, "systemd.unified_cgroup_hierarchy=1") < 0)
+			return log_oom();


@brauner , thanks!

systemd-nspawn -D /nspawn-root/ -b 3

works fine.
But

-bash-4.3# systemd-nspawn -D /nspawn-root/ /usr/lib/systemd/systemd 3 ... container# grep cgroup /proc/self/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0

This is a regression.

Another issue: we overwrite the user's setting

-bash-4.3# systemd-nspawn -D /nspawn-root -b 3 systemd.unified_cgroup_hierarchy=0 ... container# grep cgroup /proc/self/mounts cgroup /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

(actually, systemd.unified_cgroup_hierarchy=... never really works. So this is not a regression. Maybe, we should document this)

brauner · 2016-06-28T08:51:27Z

The clean way to handle cgroup namespaces would be to delegate mounting of
cgroups completely to the init system in the container. However, this would
likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of
systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply
requests a shell and no init is available to mount cgroups. I've changed the implementation to account for this by "manually" mounting a cgroupfs even when cgroup namespaces are present.

evverx · 2016-06-28T18:55:53Z

src/nspawn/nspawn.c

+				arg_uid_range,
+				arg_selinux_apifs_context);
+		if (r < 0)
+			return r;


Jun 28 13:12:02 adt systemd[1]: Starting Container c1... Jun 28 13:12:02 adt systemd-nspawn[1485]: Selected user namespace base 84410368 and range 65536. Jun 28 13:12:02 adt systemd-nspawn[1485]: mount(/var/lib/machines/c1/sys/fs/selinux) failed, ignoring: No such file or directory Jun 28 13:12:02 adt systemd-nspawn[1485]: mount(/var/lib/machines/c1/sys/fs/selinux) failed, ignoring: Invalid argument Jun 28 13:12:02 adt systemd-nspawn[1485]: Timezone Etc/UTC does not exist in container, not updating container timezone. Jun 28 13:12:02 adt systemd-nspawn[1485]: Failed to determine if /sys/fs/cgroup is already mounted: No such file or directory Jun 28 13:12:02 adt systemd-nspawn[1485]: Child died too early. Jun 28 13:12:02 adt systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE Jun 28 13:12:02 adt systemd[1]: Failed to start Container c1. Jun 28 13:12:02 adt systemd[1]: [email protected]: Unit entered failed state. Jun 28 13:12:02 adt systemd[1]: [email protected]: Failed with result 'exit-code'.

I think

r = path_is_mount_point(cgroup_root, AT_SYMLINK_FOLLOW); if (r < 0) return log_error_errno(r, "Failed to determine if /sys/fs/cgroup is already mounted: %m");

doesn't work in the inner child (after mount_move_root)
Seems like we should check /sys/fs/cgroup in the outer_child and pass the result of the check to the inner_child.

No, this is not the real cause. The real cause is that sys is mounted read-only when --private-veth is used. So we are not allowed to create /sys/fs/cgroup which fails prior to the call you're pointing to.

oh, right

[pid 8274] 1467144470.419389 mount(NULL, "/sys", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0 [...] [pid 8274] 1467144470.419942 stat("/sys/fs", {st_dev=makedev(0, 42), st_ino=3, st_mode=S_IFDIR|0755, st_nlink=4, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=80, st_atime=2016/06/28-20:07:50.346550852, st_mtime=2016/06/28-20:07:50.418552403, st_ctime=2016/06/28-20:07:50.418552403}) = 0 [pid 8274] 1467144470.420194 mkdir("/sys/fs/cgroup", 0755) = -1 EROFS (Read-only file system) [pid 8274] 1467144470.420240 lstat("/sys", {st_dev=makedev(0, 42), st_ino=2, st_mode=S_IFDIR|0755, st_nlink=9, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=180, st_atime=2016/06/28-20:07:50.342550765, st_mtime=2016/06/28-20:07:50.418552403, st_ctime=2016/06/28-20:07:50.418552403}) = 0 [pid 8274] 1467144470.420292 lstat("/sys/fs", {st_dev=makedev(0, 42), st_ino=3, st_mode=S_IFDIR|0755, st_nlink=4, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=80, st_atime=2016/06/28-20:07:50.346550852, st_mtime=2016/06/28-20:07:50.418552403, st_ctime=2016/06/28-20:07:50.418552403}) = 0 [pid 8274] 1467144470.420338 lstat("/sys/fs/cgroup", 0x7ffcdafb3760) = -1 ENOENT (No such file or directory) [pid 8274] 1467144470.420385 writev(2, [{"Failed to determine if /sys/fs/cgroup is already mounted: No such file or directory", 83}, {"\n", 1}], 2) = 84

sorry

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Chatur27 <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Chatur27 <[email protected]> Signed-off-by: Tiktodz <[email protected]> Signed-off-by: Kneba <[email protected]> Signed-off-by: dotkit <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Chatur27 <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Kunmun <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Chatur27 <[email protected]>

…p namespace On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Change-Id: I7e2210ad1a3e605fa10ad1f723214b3adb2dfb5e Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) (cherry picked from commit 9157056) Signed-off-by: nostalgiceagle <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Oktapra Amtono <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Chatur27 <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Kunmun <[email protected]>

On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Evgeny Vereshchagin <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Aditya Kali <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: [email protected] # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: systemd/systemd#3589 (comment) Signed-off-by: Chatur27 <[email protected]>

poettering added nspawn cgroups labels Jun 23, 2016

poettering reviewed Jun 23, 2016
View reviewed changes

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jun 23, 2016

brauner force-pushed the cgroup_namespace branch from a3d0aae to 3c1f06e Compare June 23, 2016 23:02

martinpitt added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Jun 24, 2016

brauner force-pushed the cgroup_namespace branch 2 times, most recently from b4f3457 to dd8e1b4 Compare June 24, 2016 16:25

evverx reviewed Jun 25, 2016
View reviewed changes

brauner force-pushed the cgroup_namespace branch from 65d1fce to 06c87a2 Compare June 26, 2016 11:34

evverx reviewed Jun 26, 2016
View reviewed changes

brauner force-pushed the cgroup_namespace branch 3 times, most recently from 922ef9e to d4bb3a8 Compare June 28, 2016 08:47

brauner force-pushed the cgroup_namespace branch from d4bb3a8 to 2812b69 Compare June 28, 2016 11:15

evverx reviewed Jun 28, 2016
View reviewed changes

Uh oh!

Conversation

brauner commented Jun 23, 2016

Uh oh!

brauner commented Jun 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering commented Jun 23, 2016

Uh oh!

martinpitt commented Jun 24, 2016

Uh oh!

evverx commented Jun 24, 2016

Uh oh!

brauner commented Jun 24, 2016

Uh oh!

evverx commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinpitt commented Jun 24, 2016

Uh oh!

evverx commented Jun 24, 2016

Uh oh!

brauner commented Jun 24, 2016

Uh oh!

brauner commented Jun 24, 2016

Uh oh!

brauner commented Jun 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evverx Jun 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evverx Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brauner commented Jun 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

9 participants

evverx commented Jun 24, 2016 •

edited

Loading

evverx Jun 25, 2016 •

edited

Loading

evverx Jun 26, 2016 •

edited

Loading