Skip to content

KEP-5714: Allow specifying whether to unshare cgroup namespaces#5715

Open
AkihiroSuda wants to merge 1 commit intokubernetes:masterfrom
AkihiroSuda:cgroupns
Open

KEP-5714: Allow specifying whether to unshare cgroup namespaces#5715
AkihiroSuda wants to merge 1 commit intokubernetes:masterfrom
AkihiroSuda:cgroupns

Conversation

@AkihiroSuda
Copy link
Copy Markdown
Member

  • One-line PR description: Allow specifying whether to unshare cgroup namespaces
  • Other comments:

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 3, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AkihiroSuda
Once this PR has been reviewed and has the lgtm label, please assign deads2k, mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

The motivation is to allow privileged pods to unshare cgroup namespaces.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this feature seems like something that could be toggled on the CRI impl side. I don't know if the use case is broad enough to warrant a new pod spec field

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NRI plugin could do it too

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this feature seems like something that could be toggled on the CRI impl side.

Disagree. https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/pod-security-admission/policy/check_hostNamespaces.go expects the namespaces to be fully controlled in the pod spec.

I don't know if the use case is broad enough to warrant a new pod spec field

I'd rather say every privileged pod should use this field to unshare the cgroup namespace, unless it really has to use the host cgroup namespace.
This applies to use cases with Podman/Buildah as well. https://www.redhat.com/en/blog/podman-inside-kubernetes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expects the namespaces to be fully controlled in the pod spec.

some of them, we never specify uts namespace because it's always pod level. that's a contract the kubelet asks the cri to uphold, but there isn't much maintaining outside of a critest. My broader point is there are plenty of cases the cri lies to the kubelet that it's doing something when really the cri is fully following users intent (NRI does this, registry mirroring does this). This case could be covered by that as well, as there's nothing in the kubelet verifying the namespace is actually pod level

 I'd rather say every privileged pod should use this field to unshare the cgroup namespace, unless it really has to use the host cgroup namespace.

yeah I get the point here, but personally I don't think we (as a kubernetes project) should try that hard to secure privileged pods. it's supposed to be a really heavy hammer. customizations can exist through NRI if needed, but the pod spec has a high barrier of entry and I'm not sure there is ecosystem integration that requires it be exposed on pod level right now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of them, we never specify uts namespace because it's always pod level.

UTS namespace does not seem comparable to cgroup namespace here, as UTSNS does not really affect anything but the hostname and the domainname, while cgroupNS affects the actual cgroup hierarchy.

### User Stories (Optional)

#### Story 1: BuildKit

See <https://github.com/moby/buildkit/pull/6368>:

> When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace
> (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP),
> the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was
> created for the buildkitd container, leading to incorrect resource accounting and enforcement
> which in turn can cause OOM errors and CPU contention on the node.

yeah I get the point here, but personally I don't think we (as a kubernetes project) should try that hard to secure privileged pods.

I didn't say we should try to secure privileged pods.
The purpose of unsharing the cgroup namespace is just to keep the /sys/fs/cgroup hierarchy consistent with normal pods.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to what other maintainers think but I'm still personally not convinced the use cases are worth the API surface

Copy link
Copy Markdown

@marxarelli marxarelli Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just chiming in as a user (and the author of the buildkit PR that @AkihiroSuda referenced).

It was surprising to me that a privileged pod shares the cgroupns of the host by default on a cgroup v2 system. I would expect that to be an opt-in setting much like hostNetwork and hostPID since it has to do with visibility (and the assumption of isolation even when privileged) rather than the actual capabilities of the pod processes. The privileged flag is indeed a huge hammer when it comes to capabilities, but IMHO that should not come at the cost of isolation by default. Isolation via a cgroupns has real utility independent of the security context. In this case, it would prevent the resource accounting and enforcement of all processes on the node to be interfered with (unintentionally) by a single pod process (see the referenced PR).

Looking at past discussions around the group v2 implementation, it seems like it was mainly for cgroup v1 backwards compatibility that the cgroup v2 implementation did not adopt an unshared cgroupns by default. Those concerns seem totally reasonable to me as a user. However, the utility of cgroupns isolation, which only became possible with cgroup v2, was never questioned in that discussion. Rather it appears to have been acknowledged explicitly.

Given the decision to not unshare cgroupns by default but the acknowledgement of its utility with regards to isolation, it seems reasonable that the API would support it.

Copy link
Copy Markdown
Member Author

@AkihiroSuda AkihiroSuda Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kubernetes/sig-node-leads Could you take a look?

Currently privileged pods have to inevitably use the host cgroup namespace, but it is relatively fragile with nested containers, and do not work with runc v1.4.0:

We the runc maintainers will find a workaround to fix this regression, but for the long-term it would be nice to allow specifying whether to use the host cgroup namespace or not

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@AkihiroSuda: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-test 870222e link true /test pull-enhancements-test
pull-enhancements-verify 870222e link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants