add KEP for cgroups v2 support#1370
Conversation
|
Welcome @giuseppe! |
|
Hi @giuseppe. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
AkihiroSuda
left a comment
There was a problem hiding this comment.
Let's unshare cgroupns by default
|
Do you think we should open a separate KEP for rootless? |
yes let's keep it separate from cgroups v2, which is already touching enough components :-) |
derekwaynecarr
left a comment
There was a problem hiding this comment.
we need to add more detail on kubelet changes where it interacts with cgroups directly
|
|
||
| ## User Stories | ||
|
|
||
| * The Kubelet can run on a cgroups v2 host |
There was a problem hiding this comment.
Clarify that kubelet will support v1 or v2
|
|
||
| Kubelet PR: [https://github.com/kubernetes/kubernetes/pull/85218](https://github.com/kubernetes/kubernetes/pull/85218) | ||
|
|
||
| ### Option 2: Use cgroups v2 throughout the stack |
There was a problem hiding this comment.
For option 1, kubelet still has to write cgroups v2 for its interactions with cgroup. Please add some detail around kubelet / cm package and eviction manager.
|
|
||
| ## Graduation Criteria | ||
|
|
||
| - Alpha: Basic support for running Kubernetes on a cgroups v2 host. |
There was a problem hiding this comment.
Worth discussing if we should add test coverage on fedora as a good candidate upstream.
|
/assign |
83faa61 to
c0c871d
Compare
| _net_prio_ are not available with the new version. The alternative to | ||
| these controllers is to use eBPF. | ||
|
|
||
| The lack of the network controllers probably affects some CNI |
There was a problem hiding this comment.
I don't currently know of any CNI plugins that use cgroups for network management. (My list is not exhaustive, of course...). Bandwidth limiting, if implemented, is done via a tc filter / qdisc for all of the plugins I'm aware of.
|
|
||
|
|
||
| With this approach cAdvisor would have to read back values from | ||
| cgroups v2 files (already done). |
There was a problem hiding this comment.
So, with this option, CRI implementations will need to map the cgroup v1 fields to cgroup v2 values, right?
Could you list what each layer needs to do for each option to make it clear?
- user (pod spec)
- kubelet
- CRI implementation
- OCI runtime
There was a problem hiding this comment.
the changes in the CRI implementation and OCI runtime might be implemented in a different way, deciding where to draw the line.
For CRI-O+crun, most of the logic is in the OCI runtime itself, but that is not the only way to achieve it.
For the Kubelet, the changes in this patch should be enough (at least until we have not hugetlb available): kubernetes/kubernetes#85218
In the second phase though, when cgroup v2 is fully supported through the stack there is need to change both pod specs+CRI.
There was a problem hiding this comment.
can you clarify why you feel the pod spec needs to change? i see no major reason to change the pod spec or resource representation.
There was a problem hiding this comment.
I was assuming we'd like to expose all/most of the cgroup v2 features.
e.g. the memory controller on cgroup v1 allows to configure:
memory.soft_limit_in_bytesmemory.limit_in_bytes
while on cgroup v2 we have:
memory.highmemory.lowmemory.maxmemory.min
but that is probably out of the scope as each new feature (if needed) must go through its own KEP?
There was a problem hiding this comment.
@derekwaynecarr are future improvements based on what cgroup 2 offers out of scope for the current KEP?
There was a problem hiding this comment.
@giuseppe future improvements for what cgroup v2 offers should be out of scope of this kep. i would keep this kep focused on kubelet is tolerant of cgroup v2 host. adding new cgroup v2 specific features to resource model would be a separate enhancement.
To address @yujuhong question, I think we are saying the following:
- no change required to pod spec
- kubelet cgroup manager uses v1 or v2 mode based on its introspection of sys/fs on startup
- kubelet qos and pod level cgroup creation on a v2 host uses the mapping table specified
- cri implementers that support cgroup v2 hosts must use a similar mapping to ensure that pod bounding cgroup values are consistent with the container cgroup values.
- oci runtime spec is not required to change as cri implementer provides mapping in the transitional period.
@giuseppe agree with above?
|
/cc @Random-Liu for containerd |
|
@yujuhong: GitHub didn't allow me to request PR reviews from the following users: for, containerd. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
23ddf34 to
a87c972
Compare
The rootless patches has been merged into the upstream Docker since early 2019 and no longer needs to be incubated in this repo. Another reason to remove Docker from this repo is that dockershim might not be going to support cgroup v2: kubernetes/enhancements#1370 (comment) Signed-off-by: Akihiro Suda <[email protected]>
The rootless patches has been merged into the upstream Docker since early 2019 and no longer needs to be incubated in this repo. Another reason to remove Docker from this repo is that dockershim might not be going to support cgroup v2: kubernetes/enhancements#1370 (comment) Signed-off-by: Akihiro Suda <[email protected]>
|
/retest |
|
per dawn's comment and discussion on sig-node, we can merge this kep, and update as we find issues with implementation in 1.19. there is no disagreement on need to iterate here. see: #1370 (comment) /approve |
Signed-off-by: Giuseppe Scrivano <[email protected]>
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, giuseppe, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
awesome work! |
|
LGTM :) |
| | period | cpu.max | y = x| period and quota are written together| | ||
| | quota | cpu.max | y = x| period and quota are written together| | ||
|
|
||
| **blkio controller** |
There was a problem hiding this comment.
Any kuberente versions and plans to support blkio controller ?
There was a problem hiding this comment.
Any kuberente versions and plans to support blkio controller ?
kubernetes/enhancements#1907 the KEP is WIP.
| new namespace. It was not enabled by default on a cgroup v1 system as | ||
| older kernel lacked support for it. | ||
|
|
||
| Privileged pods will still use the host cgroup namespace so to have |
There was a problem hiding this comment.
Let's make this customizable, especially for nested containers:
Signed-off-by: Giuseppe Scrivano [email protected]