WIP: Make /sys/fs/cgroup rw by agowa · Pull Request #42043 · moby/moby

agowa · 2021-02-17T20:21:47Z

Fixes #42040

- What I did

Makes /sys/fs/cgroup rw. This patch is an only intended as reference as I don't know how to make this change dependent upon the used cgroup version.

It should be rw for cgroup2 and ro for cgroup1 (because of container escapes).

- How I did it

- How to verify it

- Description for the changelog
Allow writing to /sys/fs/cgroup to allow containers to manage there own cgroup spaces and create child spaces. E.g. to run systemd within a container without unnecessary privileges.

thaJeztah · 2021-02-17T20:25:33Z

/cc @AkihiroSuda @justincormack

AkihiroSuda

Before changing the default we should have a flag such as --security-opt cgroupfs-writable=BOOLEAN to toggle the behavior.

Also, even if we are going to change the default, the default should remain RO when cgroupns is set to the host mode.
And yet I'm not sure whether cgroup2 is really guaranteed to be safely writable. The current version might be safe, but future version might not be.

So I'm reluctant to change the default.

agowa · 2021-02-19T03:27:32Z

@AkihiroSuda You're right, a flag may be a better option, but I don't know the moby code base well enough to implement this. Therefore I changed the default and marked this PR as WIP so that it is clear what configuration I intend to have changed.

Also as to my knowledge having the v2 cgroups writeable should be fine, but a security assessment may prove useful here.

AkihiroSuda · 2021-02-19T09:18:39Z

cc @giuseppe @kolyshkin @cyphar FYI

cyphar · 2021-02-19T10:37:23Z

Since the cgroup the process is in is the same cgroup where the restrictions are set, I don't think it's at all safe to allow containers to write to cgroup files -- that would allow a container to modify its own restrictions. If we created a sub-cgroup that the container was placed inside, maybe that would be safe but I'm not convinced to be honest.

As an aside, I wonder if you can change the devices eBPF program today with a read-only mount -- IOW does bpf(2) actually stop you from operating on a read-only mounted cgroup? (And the devices cgroup is the strongest argument for not doing this -- a container being able to modify its own devices restrictions would be very bad.)

agowa · 2021-02-19T10:39:59Z

Today you cannot set eBPF (and afaik not even with this patch) as that is filtered (at least that's the error you get when you apply this patch and try to start systemd.

Ok, I thought docker already creates a sub-cgroup for the containers. That was my fault than. But having cgroup support within containers would still be very valuable.
If this doesn't even work with sub-cgroups, is there a change to get this possibility upstream into the linux kernel so that docker can use it?

cyphar · 2021-02-19T10:46:26Z

Yes, bpf(2) is blocked by the default seccomp profile (unless your container has CAP_SYS_ADMIN) -- the reason I mentioned the devices eBPF is because I hadn't considered whether BPF_PROG_ATTACH actually obeys read-only mounts for attaching (and a quick look at the code tells me it doesn't, but I haven't tested it yet).

Ok, I thought docker already creates a sub-cgroup for the containers. That was my fault than. But having cgroup support within containers would still be very valuable.

I would have to think about how we would implement this, since runc is doing cgroup setup and there isn't a nice way to tell runc to configure a cgroup and then move the program into a sub-cgroup...

If this doesn't even work with sub-cgroups, is there a change to get this possibility upstream into the linux kernel so that docker can use it?

If containers were placed in a sub-cgroup such that the restricted cgroup is a parent, then this should be safe -- Tejun has said that cgroupv2 is safe for delegation, which tells me that the kernel is giving us a guarantee that this is safe.

agowa · 2021-02-19T10:48:13Z

So we depend upon adding a new feature to runc to allow to create sub-cgroup than?

darkyzhou · 2023-02-08T15:06:14Z

Any updates? I see podman "solves" this by introducing a new option: --security-opt unmask=/sys/fs/cgroup

containers/podman#9536

tianon · 2024-11-05T22:49:15Z

In re-reading this thread with fresh eyes, it occurs to me that we're discussing two separate things that I think could be disconnected in order to make some progress?

in order to enable this by default, we need more functionality/UX design in runc (harder)

in order to make it possible for users to opt into this (without the full "hammer" that is --privileged), I think we just need to add a new security opt and check for it in the following block of code?

moby/daemon/oci_linux.go

Lines 667 to 675 in f0cec02

    
           // TODO: until a kernel/mount solution exists for handling remount in a user namespace, 
        
           // we must clear the readonly flag for the cgroups mount (@mrunalp concurs) 
        
           if uidMap := daemon.idMapping.UIDMaps; uidMap != nil || c.HostConfig.Privileged { 
        
           	for i, m := range s.Mounts { 
        
           		if m.Type == "cgroup" { 
        
           			clearReadOnly(&s.Mounts[i]) 
        
           		} 
        
           	} 
        
           }

(the argument for the latter without necessarily having the former is that there are use cases where the downsides listed above are not as bad, such as containers that are already mostly unconfined from a cgroups perspective -- certainly better than going all the way to privileged!)

Fixes moby#42040 Closes moby#42043 Rather than making cgroups read-write by default, instead have a flag for making it possible. Since these security options are passed through the cli to daemon API, no changes are needed to docker-cli. Since this is currently only a single toggle, I also considered making it a `bool` like `cgroups-rw=true`. I could go either way. It being a string makes the intuitive value kindof hidden to users even moreso than the args to --security-opt already are. Signed-off-by: Vincent Batts <[email protected]>

…an option Fixes moby#42040 Closes moby#42043 Rather than making cgroups read-write by default, instead have a flag for making it possible. Since these security options are passed through the cli to daemon API, no changes are needed to docker-cli. Signed-off-by: Vincent Batts <[email protected]>

kolyshkin · 2024-11-15T00:25:52Z

In podman, the relevant CLI option is --security-opt=unmask=/sys/fs/cgroup, described here (look for `unmask).

If at all possible, can the option being added be made compatible with podman?

tianon · 2024-11-16T00:23:48Z

(I've replied to that same question over in #48828 (comment) instead where we're reviewing a explicit proposal adding an option 👀)

/sys/fs/cgroup rw

42a76bf

thaJeztah added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Feb 17, 2021

agowa mentioned this pull request Feb 17, 2021

make /sys/fs/cgroup rw by default with cgroup2 #42040

Closed

AkihiroSuda requested changes Feb 18, 2021

View reviewed changes

AkihiroSuda added area/security area/cgroup2 cgroup v2 kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. labels Feb 18, 2021

peterzhuamazon mentioned this pull request Sep 20, 2023

[BUG] AL2023 with systemd 252 cause docker not able to run systemd on cgroupv2 opensearch-project/opensearch-build#4047

Closed

vbatts mentioned this pull request Nov 6, 2024

daemon/&container/: enable --security-opt writable-cgroups=true|false as an option #48828

Merged

thaJeztah closed this in e3cdd59 Jan 23, 2025

thaJeztah closed this in #48828 Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Make /sys/fs/cgroup rw#42043

WIP: Make /sys/fs/cgroup rw#42043
agowa wants to merge 1 commit intomoby:masterfrom
agowa:sys_fs_cgroup_rw_patch

agowa commented Feb 17, 2021

Uh oh!

thaJeztah commented Feb 17, 2021

Uh oh!

AkihiroSuda left a comment •

edited

Loading

Uh oh!

agowa commented Feb 19, 2021 •

edited

Loading

Uh oh!

AkihiroSuda commented Feb 19, 2021

Uh oh!

cyphar commented Feb 19, 2021 •

edited

Loading

Uh oh!

agowa commented Feb 19, 2021 •

edited

Loading

Uh oh!

cyphar commented Feb 19, 2021

Uh oh!

agowa commented Feb 19, 2021

Uh oh!

darkyzhou commented Feb 8, 2023 •

edited

Loading

Uh oh!

tianon commented Nov 5, 2024

Uh oh!

kolyshkin commented Nov 15, 2024 •

edited

Loading

Uh oh!

tianon commented Nov 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

agowa commented Feb 17, 2021

Uh oh!

thaJeztah commented Feb 17, 2021

Uh oh!

AkihiroSuda left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agowa commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AkihiroSuda commented Feb 19, 2021

Uh oh!

cyphar commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agowa commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Feb 19, 2021

Uh oh!

agowa commented Feb 19, 2021

Uh oh!

darkyzhou commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianon commented Nov 5, 2024

Uh oh!

kolyshkin commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianon commented Nov 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AkihiroSuda left a comment •

edited

Loading

agowa commented Feb 19, 2021 •

edited

Loading

cyphar commented Feb 19, 2021 •

edited

Loading

agowa commented Feb 19, 2021 •

edited

Loading

darkyzhou commented Feb 8, 2023 •

edited

Loading

kolyshkin commented Nov 15, 2024 •

edited

Loading