Skip to content

WIP: Make /sys/fs/cgroup rw#42043

Closed
agowa wants to merge 1 commit intomoby:masterfrom
agowa:sys_fs_cgroup_rw_patch
Closed

WIP: Make /sys/fs/cgroup rw#42043
agowa wants to merge 1 commit intomoby:masterfrom
agowa:sys_fs_cgroup_rw_patch

Conversation

@agowa
Copy link
Copy Markdown

@agowa agowa commented Feb 17, 2021

Fixes #42040

- What I did

Makes /sys/fs/cgroup rw. This patch is an only intended as reference as I don't know how to make this change dependent upon the used cgroup version.

It should be rw for cgroup2 and ro for cgroup1 (because of container escapes).

- How I did it

- How to verify it

- Description for the changelog
Allow writing to /sys/fs/cgroup to allow containers to manage there own cgroup spaces and create child spaces. E.g. to run systemd within a container without unnecessary privileges.

@thaJeztah
Copy link
Copy Markdown
Member

/cc @AkihiroSuda @justincormack

@thaJeztah thaJeztah added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Feb 17, 2021
Copy link
Copy Markdown
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before changing the default we should have a flag such as --security-opt cgroupfs-writable=BOOLEAN to toggle the behavior.

Also, even if we are going to change the default, the default should remain RO when cgroupns is set to the host mode.
And yet I'm not sure whether cgroup2 is really guaranteed to be safely writable. The current version might be safe, but future version might not be.

So I'm reluctant to change the default.

@AkihiroSuda AkihiroSuda added area/security area/cgroup2 cgroup v2 kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. labels Feb 18, 2021
@agowa
Copy link
Copy Markdown
Author

agowa commented Feb 19, 2021

@AkihiroSuda You're right, a flag may be a better option, but I don't know the moby code base well enough to implement this. Therefore I changed the default and marked this PR as WIP so that it is clear what configuration I intend to have changed.

Also as to my knowledge having the v2 cgroups writeable should be fine, but a security assessment may prove useful here.

@AkihiroSuda
Copy link
Copy Markdown
Member

cc @giuseppe @kolyshkin @cyphar FYI

@cyphar
Copy link
Copy Markdown
Contributor

cyphar commented Feb 19, 2021

Since the cgroup the process is in is the same cgroup where the restrictions are set, I don't think it's at all safe to allow containers to write to cgroup files -- that would allow a container to modify its own restrictions. If we created a sub-cgroup that the container was placed inside, maybe that would be safe but I'm not convinced to be honest.

As an aside, I wonder if you can change the devices eBPF program today with a read-only mount -- IOW does bpf(2) actually stop you from operating on a read-only mounted cgroup? (And the devices cgroup is the strongest argument for not doing this -- a container being able to modify its own devices restrictions would be very bad.)

@agowa
Copy link
Copy Markdown
Author

agowa commented Feb 19, 2021

Today you cannot set eBPF (and afaik not even with this patch) as that is filtered (at least that's the error you get when you apply this patch and try to start systemd.

Ok, I thought docker already creates a sub-cgroup for the containers. That was my fault than. But having cgroup support within containers would still be very valuable.
If this doesn't even work with sub-cgroups, is there a change to get this possibility upstream into the linux kernel so that docker can use it?

@cyphar
Copy link
Copy Markdown
Contributor

cyphar commented Feb 19, 2021

Yes, bpf(2) is blocked by the default seccomp profile (unless your container has CAP_SYS_ADMIN) -- the reason I mentioned the devices eBPF is because I hadn't considered whether BPF_PROG_ATTACH actually obeys read-only mounts for attaching (and a quick look at the code tells me it doesn't, but I haven't tested it yet).

Ok, I thought docker already creates a sub-cgroup for the containers. That was my fault than. But having cgroup support within containers would still be very valuable.

I would have to think about how we would implement this, since runc is doing cgroup setup and there isn't a nice way to tell runc to configure a cgroup and then move the program into a sub-cgroup...

If this doesn't even work with sub-cgroups, is there a change to get this possibility upstream into the linux kernel so that docker can use it?

If containers were placed in a sub-cgroup such that the restricted cgroup is a parent, then this should be safe -- Tejun has said that cgroupv2 is safe for delegation, which tells me that the kernel is giving us a guarantee that this is safe.

@agowa
Copy link
Copy Markdown
Author

agowa commented Feb 19, 2021

So we depend upon adding a new feature to runc to allow to create sub-cgroup than?

@darkyzhou
Copy link
Copy Markdown

darkyzhou commented Feb 8, 2023

Any updates? I see podman "solves" this by introducing a new option: --security-opt unmask=/sys/fs/cgroup

containers/podman#9536

@tianon
Copy link
Copy Markdown
Member

tianon commented Nov 5, 2024

In re-reading this thread with fresh eyes, it occurs to me that we're discussing two separate things that I think could be disconnected in order to make some progress?

  1. in order to enable this by default, we need more functionality/UX design in runc (harder)

  2. in order to make it possible for users to opt into this (without the full "hammer" that is --privileged), I think we just need to add a new security opt and check for it in the following block of code?

    moby/daemon/oci_linux.go

    Lines 667 to 675 in f0cec02

    // TODO: until a kernel/mount solution exists for handling remount in a user namespace,
    // we must clear the readonly flag for the cgroups mount (@mrunalp concurs)
    if uidMap := daemon.idMapping.UIDMaps; uidMap != nil || c.HostConfig.Privileged {
    for i, m := range s.Mounts {
    if m.Type == "cgroup" {
    clearReadOnly(&s.Mounts[i])
    }
    }
    }

(the argument for the latter without necessarily having the former is that there are use cases where the downsides listed above are not as bad, such as containers that are already mostly unconfined from a cgroups perspective -- certainly better than going all the way to privileged!)

vbatts added a commit to vbatts/moby that referenced this pull request Nov 6, 2024
Fixes moby#42040
Closes moby#42043

Rather than making cgroups read-write by default, instead have a flag
for making it possible.

Since these security options are passed through the cli to daemon API,
no changes are needed to docker-cli.

Since this is currently only a single toggle, I also considered making
it a `bool` like `cgroups-rw=true`. I could go either way. It being a
string makes the intuitive value kindof hidden to users even moreso than
the args to --security-opt already are.

Signed-off-by: Vincent Batts <[email protected]>
vbatts added a commit to vbatts/moby that referenced this pull request Nov 6, 2024
Fixes moby#42040
Closes moby#42043

Rather than making cgroups read-write by default, instead have a flag
for making it possible.

Since these security options are passed through the cli to daemon API,
no changes are needed to docker-cli.

Since this is currently only a single toggle, I also considered making
it a `bool` like `cgroups-rw=true`. I could go either way. It being a
string makes the intuitive value kindof hidden to users even moreso than
the args to --security-opt already are.

Signed-off-by: Vincent Batts <[email protected]>
vbatts added a commit to vbatts/moby that referenced this pull request Nov 8, 2024
…an option

Fixes moby#42040
Closes moby#42043

Rather than making cgroups read-write by default, instead have a flag
for making it possible.

Since these security options are passed through the cli to daemon API,
no changes are needed to docker-cli.

Signed-off-by: Vincent Batts <[email protected]>
vbatts added a commit to vbatts/moby that referenced this pull request Nov 12, 2024
…an option

Fixes moby#42040
Closes moby#42043

Rather than making cgroups read-write by default, instead have a flag
for making it possible.

Since these security options are passed through the cli to daemon API,
no changes are needed to docker-cli.

Signed-off-by: Vincent Batts <[email protected]>
@kolyshkin
Copy link
Copy Markdown
Contributor

kolyshkin commented Nov 15, 2024

In podman, the relevant CLI option is --security-opt=unmask=/sys/fs/cgroup, described here (look for `unmask).

If at all possible, can the option being added be made compatible with podman?

@tianon
Copy link
Copy Markdown
Member

tianon commented Nov 16, 2024

(I've replied to that same question over in #48828 (comment) instead where we're reviewing a explicit proposal adding an option 👀)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cgroup2 cgroup v2 area/security dco/no Automatically set by a bot when one of the commits lacks proper signature kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make /sys/fs/cgroup rw by default with cgroup2

7 participants