WIP: Make /sys/fs/cgroup rw#42043
Conversation
There was a problem hiding this comment.
Before changing the default we should have a flag such as --security-opt cgroupfs-writable=BOOLEAN to toggle the behavior.
Also, even if we are going to change the default, the default should remain RO when cgroupns is set to the host mode.
And yet I'm not sure whether cgroup2 is really guaranteed to be safely writable. The current version might be safe, but future version might not be.
So I'm reluctant to change the default.
|
@AkihiroSuda You're right, a flag may be a better option, but I don't know the moby code base well enough to implement this. Therefore I changed the default and marked this PR as WIP so that it is clear what configuration I intend to have changed. Also as to my knowledge having the v2 cgroups writeable should be fine, but a security assessment may prove useful here. |
|
cc @giuseppe @kolyshkin @cyphar FYI |
|
Since the cgroup the process is in is the same cgroup where the restrictions are set, I don't think it's at all safe to allow containers to write to cgroup files -- that would allow a container to modify its own restrictions. If we created a sub-cgroup that the container was placed inside, maybe that would be safe but I'm not convinced to be honest. As an aside, I wonder if you can change the devices eBPF program today with a read-only mount -- IOW does |
|
Today you cannot set eBPF (and afaik not even with this patch) as that is filtered (at least that's the error you get when you apply this patch and try to start systemd. Ok, I thought docker already creates a sub-cgroup for the containers. That was my fault than. But having cgroup support within containers would still be very valuable. |
|
Yes,
I would have to think about how we would implement this, since runc is doing cgroup setup and there isn't a nice way to tell runc to configure a cgroup and then move the program into a sub-cgroup...
If containers were placed in a sub-cgroup such that the restricted cgroup is a parent, then this should be safe -- Tejun has said that cgroupv2 is safe for delegation, which tells me that the kernel is giving us a guarantee that this is safe. |
|
So we depend upon adding a new feature to runc to allow to create sub-cgroup than? |
|
Any updates? I see podman "solves" this by introducing a new option: |
|
In re-reading this thread with fresh eyes, it occurs to me that we're discussing two separate things that I think could be disconnected in order to make some progress?
(the argument for the latter without necessarily having the former is that there are use cases where the downsides listed above are not as bad, such as containers that are already mostly unconfined from a cgroups perspective -- certainly better than going all the way to privileged!) |
Fixes moby#42040 Closes moby#42043 Rather than making cgroups read-write by default, instead have a flag for making it possible. Since these security options are passed through the cli to daemon API, no changes are needed to docker-cli. Since this is currently only a single toggle, I also considered making it a `bool` like `cgroups-rw=true`. I could go either way. It being a string makes the intuitive value kindof hidden to users even moreso than the args to --security-opt already are. Signed-off-by: Vincent Batts <[email protected]>
Fixes moby#42040 Closes moby#42043 Rather than making cgroups read-write by default, instead have a flag for making it possible. Since these security options are passed through the cli to daemon API, no changes are needed to docker-cli. Since this is currently only a single toggle, I also considered making it a `bool` like `cgroups-rw=true`. I could go either way. It being a string makes the intuitive value kindof hidden to users even moreso than the args to --security-opt already are. Signed-off-by: Vincent Batts <[email protected]>
…an option Fixes moby#42040 Closes moby#42043 Rather than making cgroups read-write by default, instead have a flag for making it possible. Since these security options are passed through the cli to daemon API, no changes are needed to docker-cli. Signed-off-by: Vincent Batts <[email protected]>
…an option Fixes moby#42040 Closes moby#42043 Rather than making cgroups read-write by default, instead have a flag for making it possible. Since these security options are passed through the cli to daemon API, no changes are needed to docker-cli. Signed-off-by: Vincent Batts <[email protected]>
|
In podman, the relevant CLI option is If at all possible, can the option being added be made compatible with podman? |
|
(I've replied to that same question over in #48828 (comment) instead where we're reviewing a explicit proposal adding an option 👀) |
Fixes #42040
- What I did
Makes /sys/fs/cgroup rw. This patch is an only intended as reference as I don't know how to make this change dependent upon the used cgroup version.
It should be rw for cgroup2 and ro for cgroup1 (because of container escapes).
- How I did it
- How to verify it
- Description for the changelog
Allow writing to /sys/fs/cgroup to allow containers to manage there own cgroup spaces and create child spaces. E.g. to run systemd within a container without unnecessary privileges.