KEP-5714: Allow specifying whether to unshare cgroup namespaces#5715
KEP-5714: Allow specifying whether to unshare cgroup namespaces#5715AkihiroSuda wants to merge 1 commit intokubernetes:masterfrom
Conversation
AkihiroSuda
commented
Dec 3, 2025
- One-line PR description: Allow specifying whether to unshare cgroup namespaces
- Issue link: Allow specifying whether to unshare cgroup namespaces #5714
- Other comments:
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: AkihiroSuda The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| [experience reports]: https://github.com/golang/go/wiki/ExperienceReports | ||
| --> | ||
|
|
||
| The motivation is to allow privileged pods to unshare cgroup namespaces. |
There was a problem hiding this comment.
TBH this feature seems like something that could be toggled on the CRI impl side. I don't know if the use case is broad enough to warrant a new pod spec field
There was a problem hiding this comment.
NRI plugin could do it too
There was a problem hiding this comment.
TBH this feature seems like something that could be toggled on the CRI impl side.
Disagree. https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/pod-security-admission/policy/check_hostNamespaces.go expects the namespaces to be fully controlled in the pod spec.
I don't know if the use case is broad enough to warrant a new pod spec field
I'd rather say every privileged pod should use this field to unshare the cgroup namespace, unless it really has to use the host cgroup namespace.
This applies to use cases with Podman/Buildah as well. https://www.redhat.com/en/blog/podman-inside-kubernetes
There was a problem hiding this comment.
expects the namespaces to be fully controlled in the pod spec.
some of them, we never specify uts namespace because it's always pod level. that's a contract the kubelet asks the cri to uphold, but there isn't much maintaining outside of a critest. My broader point is there are plenty of cases the cri lies to the kubelet that it's doing something when really the cri is fully following users intent (NRI does this, registry mirroring does this). This case could be covered by that as well, as there's nothing in the kubelet verifying the namespace is actually pod level
I'd rather say every privileged pod should use this field to unshare the cgroup namespace, unless it really has to use the host cgroup namespace.
yeah I get the point here, but personally I don't think we (as a kubernetes project) should try that hard to secure privileged pods. it's supposed to be a really heavy hammer. customizations can exist through NRI if needed, but the pod spec has a high barrier of entry and I'm not sure there is ecosystem integration that requires it be exposed on pod level right now.
There was a problem hiding this comment.
some of them, we never specify uts namespace because it's always pod level.
UTS namespace does not seem comparable to cgroup namespace here, as UTSNS does not really affect anything but the hostname and the domainname, while cgroupNS affects the actual cgroup hierarchy.
### User Stories (Optional)
#### Story 1: BuildKit
See <https://github.com/moby/buildkit/pull/6368>:
> When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace
> (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP),
> the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was
> created for the buildkitd container, leading to incorrect resource accounting and enforcement
> which in turn can cause OOM errors and CPU contention on the node.yeah I get the point here, but personally I don't think we (as a kubernetes project) should try that hard to secure privileged pods.
I didn't say we should try to secure privileged pods.
The purpose of unsharing the cgroup namespace is just to keep the /sys/fs/cgroup hierarchy consistent with normal pods.
There was a problem hiding this comment.
I'm open to what other maintainers think but I'm still personally not convinced the use cases are worth the API surface
There was a problem hiding this comment.
I'm just chiming in as a user (and the author of the buildkit PR that @AkihiroSuda referenced).
It was surprising to me that a privileged pod shares the cgroupns of the host by default on a cgroup v2 system. I would expect that to be an opt-in setting much like hostNetwork and hostPID since it has to do with visibility (and the assumption of isolation even when privileged) rather than the actual capabilities of the pod processes. The privileged flag is indeed a huge hammer when it comes to capabilities, but IMHO that should not come at the cost of isolation by default. Isolation via a cgroupns has real utility independent of the security context. In this case, it would prevent the resource accounting and enforcement of all processes on the node to be interfered with (unintentionally) by a single pod process (see the referenced PR).
Looking at past discussions around the group v2 implementation, it seems like it was mainly for cgroup v1 backwards compatibility that the cgroup v2 implementation did not adopt an unshared cgroupns by default. Those concerns seem totally reasonable to me as a user. However, the utility of cgroupns isolation, which only became possible with cgroup v2, was never questioned in that discussion. Rather it appears to have been acknowledged explicitly.
Given the decision to not unshare cgroupns by default but the acknowledgement of its utility with regards to isolation, it seems reasonable that the API would support it.
There was a problem hiding this comment.
@kubernetes/sig-node-leads Could you take a look?
Currently privileged pods have to inevitably use the host cgroup namespace, but it is relatively fragile with nested containers, and do not work with runc v1.4.0:
We the runc maintainers will find a workaround to fix this regression, but for the long-term it would be nice to allow specifying whether to use the host cgroup namespace or not
Signed-off-by: Akihiro Suda <[email protected]>
7ec2861 to
870222e
Compare
|
@AkihiroSuda: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |