Skip to content

Cgroup namespace#3589

Merged
poettering merged 2 commits intosystemd:masterfrom
brauner:cgroup_namespace
Jul 25, 2016
Merged

Cgroup namespace#3589
poettering merged 2 commits intosystemd:masterfrom
brauner:cgroup_namespace

Conversation

@brauner
Copy link
Contributor

@brauner brauner commented Jun 23, 2016

This adds support for cgroup namespaces which are available since 4.6. Cgroup namespaces work with both, the legacy and unified cgroup hierarchy. For legacy:

Inside new cgroup namespace:

sudo unshare --cgroup

conventiont:/home/chb # cat /proc/self/cgroup
11:blkio:/
10:memory:/
9:net_cls,net_prio:/
8:hugetlb:/
7:cpu,cpuacct:/
6:freezer:/
5:cpuset:/
4:pids:/
3:devices:/
2:perf_event:/
1:name=systemd:/

conventiont:/home/chb # ls -al /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 Jun 23 13:52 /proc/self/ns/cgroup -> cgroup:[4026532485]

Parent cgroup namespace:

[chb@conventiont ~]$ cat /proc/self/cgroup
11:blkio:/user.slice
10:memory:/user.slice
9:net_cls,net_prio:/user.slice
8:hugetlb:/
7:cpu,cpuacct:/user.slice
6:freezer:/
5:cpuset:/
4:pids:/user.slice/user-1000.slice/session-1.scope
3:devices:/user.slice
2:perf_event:/
1:name=systemd:/user.slice/user-1000.slice/session-1.scope

[chb@conventiont ~]$ ls -al /proc/self/ns/cgroup
lrwxrwxrwx 1 chb users 0 Jun 23 13:52 /proc/self/ns/cgroup -> 'cgroup:[4026531835]'

@brauner
Copy link
Contributor Author

brauner commented Jun 23, 2016

Just saw: related to #2112.


bool cg_ns_supported(void)
{
return access("/proc/self/ns/cgroup", F_OK) == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: please follow the usual coding style, and place the opening bracket on the same line as the function name. i.e.:

bool cg_ns_supported(void) {
…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, was irritated by the function definition directly above.

@poettering
Copy link
Member

looks pretty good. mostly minor issues.

(oh, one more thing: we don't use Signed-off-by in systemd, that's a kernel thing)

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jun 23, 2016
@brauner brauner force-pushed the cgroup_namespace branch from a3d0aae to 3c1f06e Compare June 23, 2016 23:02
@martinpitt martinpitt added the ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR label Jun 24, 2016
@martinpitt
Copy link
Contributor

This also seems to break nspawn, see the autopkgtest log for the failed "build-and-services" and the first "upstream" nspawn test:

Jun 23 23:59:03 adt systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 23 23:59:03 adt systemd-nspawn[1482]: Failed to mount /sys/fs/cgroup: Invalid argument
Jun 23 23:59:03 adt systemd[1]: Failed to start Container c1.
Jun 23 23:59:03 adt systemd-nspawn[1482]: Child died too early.

@evverx
Copy link
Contributor

evverx commented Jun 24, 2016

Failed to mount /sys/fs/cgroup: Invalid argument

Yeah, this is on 4.4

On 4.6:

root# uname -r
4.6.2-1-ARCH

root# env UNIFIED_CGROUP_HIERARCHY=no ../../systemd-nspawn --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.6Rcdc5/nspawn-root /usr/lib/systemd/systemd systemd.unit=multi-user.target
Spawning container nspawn-root on /var/tmp/systemd-test.6Rcdc5/nspawn-root.
Press ^] three times within 1s to kill container.
Child died too early.
Failed to read link /sys/fs/cgroup/cpu: No such file or directory

@brauner
Copy link
Contributor Author

brauner commented Jun 24, 2016

I'm on this. Sorry.

@evverx
Copy link
Contributor

evverx commented Jun 24, 2016

@brauner , oh, sorry, I was wrong.
This works on 4.6. (I didn't update the whole systemd, only systemd-nspawn)
I tested master by mistake.

So, yeah,

Failed to read link /sys/fs/cgroup/cpu: No such file or directory

@martinpitt
Copy link
Contributor

Note that the Ubuntu 4.4 kernels has the cgroup namespace feature backported, as we use it for LXD. If that's somehow incomplete, I can move the testing to a newer kernel (with some additional overhead). However, AFAIR current systemd policy is that things should generally work with kernels ≤ 2 years old, so things should at least have a reasonable fallback.

@evverx
Copy link
Contributor

evverx commented Jun 24, 2016

Note that the Ubuntu 4.4 kernels has the cgroup namespace feature backported

Oh, indeed:

$ uname -a
Linux ubuntu-yakkety 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ sudo unshare -C
root@ubuntu-yakkety:~# cat /proc/self/cgroup
11:pids:/
10:cpuset:/
9:blkio:/
8:devices:/
7:memory:/
6:perf_event:/
5:hugetlb:/
4:cpu,cpuacct:/
3:freezer:/
2:net_cls,net_prio:/
1:name=systemd:/

Good to know, thanks!

I've checked this patch on Fedora 24:

$ uname -r
4.5.7-300.fc24.x86_64

$ sudo strace unshare -C
...
unshare(CLONE_NEWCGROUP)                = -1 EINVAL (Invalid argument)
...
$ sudo systemd-nspawn -D /nspawn-root -b 3
...works fine...

@brauner
Copy link
Contributor Author

brauner commented Jun 24, 2016

So I misconstrued how systemd-nspawn handles mounting cgroups at first read. I think what we simply can do is unshare(CLONE_NEWCGROUP) in the inner child after the cgroups have been mounted. The problem is then that from an information-leak point of view you get cat /proc/self/cgroup to always show / while you still have access to the whole root cgroup tree under /sys/fs/cgrou because of how systemd currently does the mounting. In lxc we hide the root cgroup tree as well. This would probably mean a more invasive change here.

@brauner
Copy link
Contributor Author

brauner commented Jun 24, 2016

Oh, and thanks for the feedback!

@brauner brauner force-pushed the cgroup_namespace branch 2 times, most recently from b4f3457 to dd8e1b4 Compare June 24, 2016 16:25
@brauner
Copy link
Contributor Author

brauner commented Jun 24, 2016

So here is how I implemented it so far: When cgroup namespaces are enabled we unshare the cgroup namespace after all limits and so on have been applied but we do not mount cgroups since that is unnecessary with cgroup namespaces and only causes information leak. We should then be correctly placed in the right cgroups when we do cat /proc/self/cgroup and should only see our root cgroup and not our parent cgroup under /sys/fs/cgroup. I have tested this with the legacy cgroup hierarchy and it works fine.

return r;
if (cg_ns_supported()) {
r = unshare(CLONE_NEWCGROUP);
if (r < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, systemd-nspawn doesn't fail on startup. But this breaks UNIFIED_CGROUP_HIERARCHY:

nspawn understands the $UNIFIED_CGROUP_HIERARCHY
environment variable to individually select the hierarchy to
use for executed containers. By default, nspawn will use the
unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

-bash-4.3# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

-bash-4.3# UNIFIED_CGROUP_HIERARCHY=yes systemd-nspawn -D /nspawn-root/ -b 3
...
container# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0

Copy link
Contributor

@evverx evverx Jun 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this works strange with the unified hierarchy:

-bash-4.3# grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

-bash-4.3# unshare -C cat /proc/self/cgroup
0::/

-bash-4.3# systemd-nspawn -D /nspawn-root -b 3
...
container# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

container# cat /proc/1/cgroup
9:cpuset:/
8:devices:/init.scope
7:cpu,cpuacct:/init.scope
6:net_cls:/
5:freezer:/
1:name=systemd:/init.scope
0::/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing on unified @evverx. Starting unified with systemd-nspawn didn't work for me with v228 independent of the patch. So maybe I need to test that from master again.

Your second point I'm not entirely clear what you're getting at. In the case of cgroup namespaces the container will be able to mount a cgroup filesystem by itself just as on normal system bootup. So we don't need to bind-mount, I think. If you're getting at the point about some subsystems missing in the container. This is explained by how cgroup v1 and v2 interact I think: As you have mounted cgroup2 on the host you likely have mounted the available subsystems memory, pid etc. into the v2 hierarchy which means that they are not mounted into the v1 hierarchy. This is why they do not appear in the container which checks the available controllers in the v1 hierarchy.

@poettering would you prefer a different approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why they do not appear in the container which checks the available controllers in the v1 hierarchy.

But why do we need to check the v1-controllers on the v2-hierarchy?

In the case of cgroup namespaces the container will be able to mount a cgroup filesystem by itself just as on normal system bootup.

Yeah. But we shouldn't mount v1 on v2 (and vice versa)

master:

-bash-4.3# systemd-nspawn -D /nspawn-root -b 3
...
container# grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

container# cat /proc/1/cgroup
0::/machine.slice/machine-nspawn\x2droot.scope/init.scope

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, something went wrong here:

int mount_cgroup_controllers(char ***join_controllers) {
         _cleanup_set_free_free_ Set *controllers = NULL;
         int r;

         if (!cg_is_legacy_wanted())
                 return 0;

         /* Mount all available cgroup controllers that are built into the kernel. */

         controllers = set_new(&string_hash_ops);
         if (!controllers)
                 return log_oom();

cg_is_legacy_wanted should return 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sat, Jun 25, 2016 at 12:24:00AM -0700, Evgeny Vereshchagin wrote:

@@ -2594,9 +2594,15 @@ static int inner_child(
return -ESRCH;
}

  •    r = mount_systemd_cgroup_writable("", arg_unified_cgroup_hierarchy);
    
  •    if (r < 0)
    
  •            return r;
    
  • if (cg_ns_supported()) {
  •   r = unshare(CLONE_NEWCGROUP);
    
  •   if (r < 0)
    

Well, systemd-nspawn doesn't fail on startup. But this breaks UNIFIED_CGROUP_HIERARCHY:
This is an indirect consequence of cgroup namespaces. With cgroup namespaces the
container will mount the cgroupfs itself. Hence, mounting the cgroupfs is the
task of systemd inside the container as opposed to bind-mount magic when
cgroup namespaces are not available. If we want systemd inside the container to
mount the unified cgroup hierarchy the simplest solution is to pass
systemd.unified_cgroup_hierarchy=1 as argument to systemd-nspawn:

systemd-nspawn -D /some/rootfs -b 'systemd.unified_cgroup_hierarchy=1'

To be backwards compatible with prior systemd-nspawn versions that allow
setting the UNIFIED_CGROUP_HIERARCHY env variable we can simply append
systemd.unified_cgroup_hierarchy=1. However, when the user simply wants a
shell inside the container things get more complicated since there is no
systemd/init process that sets up the cgroupfs.

Minor point: Note also, that the systemd v230 release notes state that booting
unified cgroups with kernels >= 4.5 requires systemd v230. This is why I
had trouble using unified cgroups:

"WARNING: it is not possible to use previous systemd versions with
systemd.unified_cgroup_hierarchy=1 and the new kernel. Therefore it is
necessary to also update systemd in the initramfs if using the unified
hierarchy. An updated SELinux policy is also required."
(https://lists.freedesktop.org/archives/systemd-devel/2016-May/036583.html)

Since the cgroup namespaces patch here requires that systemd inside the
container mounts the cgroup it means that systemd v230 is required inside the
container with a kernel >=4.5.

nspawn understands the $UNIFIED_CGROUP_HIERARCHY
environment variable to individually select the hierarchy to
use for executed containers. By default, nspawn will use the
unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

-bash-4.3# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

-bash-4.3# UNIFIED_CGROUP_HIERARCHY=yes systemd-nspawn -D /nspawn-root/ -b 3
...
container# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0

---
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/systemd/systemd/pull/3589/files/dd8e1b4bf0b4e6180812428053d6dfb97d66b4db#r68485591

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sat, Jun 25, 2016 at 03:08:20AM -0700, Evgeny Vereshchagin wrote:

@@ -2594,9 +2594,15 @@ static int inner_child(
return -ESRCH;
}

  •    r = mount_systemd_cgroup_writable("", arg_unified_cgroup_hierarchy);
    
  •    if (r < 0)
    
  •            return r;
    
  • if (cg_ns_supported()) {

  •   r = unshare(CLONE_NEWCGROUP);
    
  •   if (r < 0)
    

    This is why they do not appear in the container which checks the available controllers in the v1 hierarchy.

But why do we need to check the v1-controllers on the v2-hierarchy?

In the case of cgroup namespaces the container will be able to mount a cgroup filesystem by itself just as on normal system bootup.

Yeah. But we shouldn't mount v1 on v2 (and vice versa)

master:

-bash-4.3# systemd-nspawn -D /nspawn-root -b 3
...
container# grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

container# cat /proc/1/cgroup
0::/machine.slice/machine-nspawn\x2droot.scope/init.scope

I can reproduce this behavior with systemd master independent of this patch.
Sorry, I'm a little confused as to what you're getting at here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a little confused as to what you're getting at here.

@brauner , sorry.
I mean:

By default, nspawn will use the unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

Your patch doesn't work as expected: #3589 (comment)

container# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

master works fine:

container# grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

Yeah

systemd-nspawn -D /some/rootfs -b 'systemd.unified_cgroup_hierarchy=1'

mounts the v2-hierarchy. But we should do this by default (i.e. without systemd.unified_cgroup_hierarchy=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sun, Jun 26, 2016 at 12:52:33AM -0700, Evgeny Vereshchagin wrote:

@@ -2594,9 +2594,15 @@ static int inner_child(
return -ESRCH;
}

  •    r = mount_systemd_cgroup_writable("", arg_unified_cgroup_hierarchy);
    
  •    if (r < 0)
    
  •            return r;
    
  • if (cg_ns_supported()) {
  •   r = unshare(CLONE_NEWCGROUP);
    
  •   if (r < 0)
    

Sorry, I'm a little confused as to what you're getting at here.

@brauner , sorry.
I mean:

By default, nspawn will use the unified hierarchy for the containers if the host uses the
unified hierarchy, and the legacy hierarchy otherwise.

Your patch doesn't work as expected: #3589 (comment)

container# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

master works fine:

container# grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup cgroup2 ro,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/machine.slice/machine-nspawn\134x2droot.scope cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

Yeah

systemd-nspawn -D /some/rootfs -b 'systemd.unified_cgroup_hierarchy=1'

mounts the v2-hierarchy. But we should do this by default (i.e. without systemd.unified_cgroup_hierarchy=1)

Thanks for the clarification, @evverx. Yes, I can think of a way to do this.
When we detect that unified is requested or used on the host we append
"systemd.unified_cgroup_hierarchy=1" to the arguments passed to the containers
init.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/systemd/systemd/pull/3589/files/dd8e1b4bf0b4e6180812428053d6dfb97d66b4db#r68498967

@brauner brauner force-pushed the cgroup_namespace branch from 65d1fce to 06c87a2 Compare June 26, 2016 11:34
// legacy cgroup.
if (arg_unified_cgroup_hierarchy && cg_ns_supported() && arg_start_mode == START_BOOT) {
if (strv_extend(&arg_parameters, "systemd.unified_cgroup_hierarchy=1") < 0)
return log_oom();
Copy link
Contributor

@evverx evverx Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brauner , thanks!

systemd-nspawn -D /nspawn-root/ -b 3

works fine.
But

-bash-4.3# systemd-nspawn -D /nspawn-root/ /usr/lib/systemd/systemd 3
...
container# grep cgroup /proc/self/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0

This is a regression.

Another issue: we overwrite the user's setting

-bash-4.3# systemd-nspawn -D /nspawn-root -b 3 systemd.unified_cgroup_hierarchy=0
...
container# grep cgroup /proc/self/mounts
cgroup /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0

(actually, systemd.unified_cgroup_hierarchy=... never really works. So this is not a regression. Maybe, we should document this)

@brauner brauner force-pushed the cgroup_namespace branch 3 times, most recently from 922ef9e to d4bb3a8 Compare June 28, 2016 08:47
@brauner
Copy link
Contributor Author

brauner commented Jun 28, 2016

The clean way to handle cgroup namespaces would be to delegate mounting of
cgroups completely to the init system in the container. However, this would
likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of
systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply
requests a shell and no init is available to mount cgroups. I've changed the implementation to account for this by "manually" mounting a cgroupfs even when cgroup namespaces are present.

@brauner brauner force-pushed the cgroup_namespace branch from d4bb3a8 to 2812b69 Compare June 28, 2016 11:15
arg_uid_range,
arg_selinux_apifs_context);
if (r < 0)
return r;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jun 28 13:12:02 adt systemd[1]: Starting Container c1...
Jun 28 13:12:02 adt systemd-nspawn[1485]: Selected user namespace base 84410368 and range 65536.
Jun 28 13:12:02 adt systemd-nspawn[1485]: mount(/var/lib/machines/c1/sys/fs/selinux) failed, ignoring: No such file or directory
Jun 28 13:12:02 adt systemd-nspawn[1485]: mount(/var/lib/machines/c1/sys/fs/selinux) failed, ignoring: Invalid argument
Jun 28 13:12:02 adt systemd-nspawn[1485]: Timezone Etc/UTC does not exist in container, not updating container timezone.
Jun 28 13:12:02 adt systemd-nspawn[1485]: Failed to determine if /sys/fs/cgroup is already mounted: No such file or directory
Jun 28 13:12:02 adt systemd-nspawn[1485]: Child died too early.
Jun 28 13:12:02 adt systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Jun 28 13:12:02 adt systemd[1]: Failed to start Container c1.
Jun 28 13:12:02 adt systemd[1]: [email protected]: Unit entered failed state.
Jun 28 13:12:02 adt systemd[1]: [email protected]: Failed with result 'exit-code'.

I think

r = path_is_mount_point(cgroup_root, AT_SYMLINK_FOLLOW);
if (r < 0)
    return log_error_errno(r, "Failed to determine if /sys/fs/cgroup is already mounted: %m");

doesn't work in the inner child (after mount_move_root)
Seems like we should check /sys/fs/cgroup in the outer_child and pass the result of the check to the inner_child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not the real cause. The real cause is that sys is mounted read-only when --private-veth is used. So we are not allowed to create /sys/fs/cgroup which fails prior to the call you're pointing to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right

[pid  8274] 1467144470.419389 mount(NULL, "/sys", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0
[...]
[pid  8274] 1467144470.419942 stat("/sys/fs", {st_dev=makedev(0, 42), st_ino=3, st_mode=S_IFDIR|0755, st_nlink=4, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=80, st_atime=2016/06/28-20:07:50.346550852, st_mtime=2016/06/28-20:07:50.418552403, st_ctime=2016/06/28-20:07:50.418552403}) = 0
[pid  8274] 1467144470.420194 mkdir("/sys/fs/cgroup", 0755) = -1 EROFS (Read-only file system)
[pid  8274] 1467144470.420240 lstat("/sys", {st_dev=makedev(0, 42), st_ino=2, st_mode=S_IFDIR|0755, st_nlink=9, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=180, st_atime=2016/06/28-20:07:50.342550765, st_mtime=2016/06/28-20:07:50.418552403, st_ctime=2016/06/28-20:07:50.418552403}) = 0
[pid  8274] 1467144470.420292 lstat("/sys/fs", {st_dev=makedev(0, 42), st_ino=3, st_mode=S_IFDIR|0755, st_nlink=4, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=80, st_atime=2016/06/28-20:07:50.346550852, st_mtime=2016/06/28-20:07:50.418552403, st_ctime=2016/06/28-20:07:50.418552403}) = 0
[pid  8274] 1467144470.420338 lstat("/sys/fs/cgroup", 0x7ffcdafb3760) = -1 ENOENT (No such file or directory)
[pid  8274] 1467144470.420385 writev(2, [{"Failed to determine if /sys/fs/cgroup is already mounted: No such file or directory", 83}, {"\n", 1}], 2) = 84

sorry

electimon pushed a commit to electimon/android_kernel_samsung_universal7904-common that referenced this pull request Jun 7, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Jul 18, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Jul 21, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Jul 21, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Jul 21, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
Kneba pushed a commit to Tiktodz/android_kernel_asus_sdm636 that referenced this pull request Aug 2, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
Signed-off-by: Tiktodz <[email protected]>
Signed-off-by: Kneba <[email protected]>
Signed-off-by: dotkit <[email protected]>
mihoy3rd pushed a commit to MissElysia/android_kernel_oppo_sdm660 that referenced this pull request Aug 9, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
mihoy3rd pushed a commit to MissElysia/android_kernel_oppo_sdm660 that referenced this pull request Aug 9, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
mihoy3rd pushed a commit to MissElysia/android_kernel_oppo_sdm660 that referenced this pull request Aug 10, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
JoysKo pushed a commit to JoysKo/HYBRID_CAF_kernel that referenced this pull request Aug 20, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Kunmun <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Aug 24, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Aug 24, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Aug 24, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
micr0softstore pushed a commit to micr0softstore/android_linux_kernel_samsung_SM-T515 that referenced this pull request Aug 26, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
micr0softstore pushed a commit to micr0softstore/android_linux_kernel_samsung_SM-T515 that referenced this pull request Aug 27, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to micr0softstore/android_linux_kernel_samsung_SM-T515 that referenced this pull request Aug 27, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
electimon pushed a commit to electimon/android_kernel_samsung_universal7904-common that referenced this pull request Aug 31, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
electimon pushed a commit to electimon/android_kernel_samsung_universal7904-common that referenced this pull request Aug 31, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
micr0softstore pushed a commit to micr0softstore/samsung_android_kernel_T510 that referenced this pull request Sep 13, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
eun0115 pushed a commit to eun0115/android_kernel_samsung_tab7904 that referenced this pull request Sep 27, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
furbanoramos21-testing pushed a commit to furbanoramos21-testing/android_kernel_samsung_exynos7870 that referenced this pull request Oct 8, 2025
…p namespace

On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Change-Id: I7e2210ad1a3e605fa10ad1f723214b3adb2dfb5e
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
(cherry picked from commit 9157056)
Signed-off-by: nostalgiceagle <[email protected]>
JoysKo pushed a commit to JoysKo/HYBRID_CAF_kernel that referenced this pull request Oct 10, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Oktapra Amtono <[email protected]>
furbanoramos21-testing pushed a commit to furbanoramos21-testing/android_kernel_samsung_exynos7870 that referenced this pull request Nov 13, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
nostalgiceagle pushed a commit to nostalgiceagle/kremol_donut_univ7904 that referenced this pull request Nov 14, 2025
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Kunmun <[email protected]>
Sliva4 pushed a commit to Sliva4/kernel_a30s that referenced this pull request Jan 11, 2026
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Kunmun <[email protected]>
tberd pushed a commit to tberd/android_kernel_lge_bullhead that referenced this pull request Feb 16, 2026
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
tberd pushed a commit to tberd/android_kernel_lge_bullhead that referenced this pull request Feb 17, 2026
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
tberd pushed a commit to tberd/android_kernel_lge_bullhead that referenced this pull request Feb 19, 2026
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
tberd pushed a commit to tberd/android_kernel_lge_bullhead that referenced this pull request Feb 20, 2026
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
tberd pushed a commit to tberd/android_kernel_lge_bullhead that referenced this pull request Feb 22, 2026
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2b0 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Evgeny Vereshchagin <[email protected]>
Cc: Serge E. Hallyn <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected] # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: systemd/systemd#3589 (comment)
Signed-off-by: Chatur27 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

9 participants