-
Notifications
You must be signed in to change notification settings - Fork 1.7k
add KEP for cgroups v2 support #1370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,241 @@ | ||
| --- | ||
| title: cgroups v2 | ||
| authors: | ||
| - "@giuseppe" | ||
| owning-sig: sig-node | ||
| participating-sigs: | ||
| - sig-architecture | ||
| reviewers: | ||
| - "@yujuhong" | ||
| - "@dchen1107" | ||
| - "@derekwaynecarr" | ||
| approvers: | ||
| - "@yujuhong" | ||
| - "@dchen1107" | ||
| - "@derekwaynecarr" | ||
| editor: Giuseppe Scrivano | ||
| creation-date: 2019-11-18 | ||
| last-updated: 2019-11-18 | ||
| status: implementable | ||
| see-also: | ||
| replaces: | ||
| superseded-by: | ||
| --- | ||
|
|
||
| # Cgroups v2 | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| <!-- toc --> | ||
| - [Summary](#summary) | ||
| - [Motivation](#motivation) | ||
| - [Goals](#goals) | ||
| - [Non-Goals](#non-goals) | ||
| - [User Stories](#user-stories) | ||
| - [Implementation Details](#implementation-details) | ||
| - [Proposal](#proposal) | ||
| - [Dependencies on OCI and container runtimes](#dependencies-on-oci-and-container-runtimes) | ||
| - [Current status of dependencies](#current-status-of-dependencies) | ||
| - [Current cgroups usage and the equivalent in cgroups v2](#current-cgroups-usage-and-the-equivalent-in-cgroups-v2) | ||
| - [cgroup namespace](#cgroup-namespace) | ||
| - [Phase 1: Convert from cgroups v1 settings to v2](#phase-1-convert-from-cgroups-v1-settings-to-v2) | ||
| - [Phase 2: Use cgroups v2 throughout the stack](#phase-2-use-cgroups-v2-throughout-the-stack) | ||
| - [Risk and Mitigations](#risk-and-mitigations) | ||
| - [Graduation Criteria](#graduation-criteria) | ||
| <!-- /toc --> | ||
|
|
||
| ## Summary | ||
|
|
||
| A proposal to add support for cgroups v2 to kubernetes. | ||
|
|
||
| ## Motivation | ||
|
|
||
| The new kernel cgroups v2 API was declared stable more than two years | ||
| ago. Newer features in the kernel such as PSI depend upon cgroups | ||
| v2. groups v1 will eventually become obsolete in favor of cgroups v2. | ||
| Some distros are already using cgroups v2 by default, and that | ||
| prevents Kubernetes from working as it is required to run with cgroups | ||
| v1. | ||
|
|
||
| ## Goals | ||
|
giuseppe marked this conversation as resolved.
|
||
|
|
||
| This proposal aims to: | ||
|
|
||
| * Add support for cgroups v2 to the Kubelet | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| * Expose new cgroup2-only features | ||
| * Dockershim | ||
| * Plugins support | ||
|
|
||
| ## User Stories | ||
|
derekwaynecarr marked this conversation as resolved.
giuseppe marked this conversation as resolved.
|
||
|
|
||
| * The Kubelet can run on a host using either cgroups v1 or v2. | ||
| * Have features parity between cgroup v2 and v1. | ||
|
|
||
| ## Implementation Details | ||
|
|
||
| ## Proposal | ||
|
|
||
| The proposal is to implement cgroups v2 in two different phases. | ||
|
|
||
| The first phase ensures that any configuration file designed for | ||
| cgroups v1 will continue to work on cgroups v2. | ||
|
|
||
| The second phase requires changes through the entire stack, including | ||
| the OCI runtime specifications. | ||
|
|
||
| At startup the Kubelet detects what hierarchy the system is using. It | ||
| checks the file system type for `/sys/fs/cgroup` (the equivalent of | ||
| `stat -f --format '%T' /sys/fs/cgroup`). If the type is `cgroup2fs` | ||
| then the Kubelet will use only cgroups v2 during all its execution. | ||
|
|
||
| The current proposal doesn't aim at deprecating cgroup v1, that must | ||
| still be supported through the stack. | ||
|
|
||
| Device plugins that require v2 enablement are out of the scope for | ||
| this proposal. | ||
|
|
||
| ### Dependencies on OCI and container runtimes | ||
|
|
||
| In order to support features only available in cgroups v2 but not in | ||
| cgroups v1, the OCI runtime specs must be changed. | ||
|
|
||
| New features that are not present in cgroup v1 are out of the scope | ||
| for this proposal. | ||
|
|
||
| The dockershim implementation embedded in the Kubelet won't be | ||
| supported on cgroup v2. | ||
|
|
||
| ### Current status of dependencies | ||
|
|
||
| - CRI-O+crun: support cgroups v2 | ||
|
|
||
| - runc support for cgroups v2 is work in progress [current status](#current-cgroups-usage-and-the-equivalent-in-cgroups-v2) | ||
|
|
||
| - containerd: [https://github.com/containerd/containerd/issues/3726](https://github.com/containerd/containerd/issues/3726) | ||
|
|
||
| - Moby: [https://github.com/moby/moby/pull/40174](https://github.com/moby/moby/pull/40174) | ||
|
|
||
| - OCI runtime spec: TODO | ||
|
|
||
| - cAdvisor already supports cgroups v2 ([https://github.com/google/cadvisor/pull/2309](https://github.com/google/cadvisor/pull/2309)) | ||
|
|
||
| ## Current cgroups usage and the equivalent in cgroups v2 | ||
|
|
||
| |Kubernetes cgroups v1|Kubernetes cgroups v2 behavior| | ||
| |---|---| | ||
| |CPU stats for Horizontal Pod Autoscaling|No .percpu cpuacct stats.| | ||
| |CPU pinning based on integral cores|Cpuset controller available| | ||
| |Memory limits|Not changed, different naming| | ||
|
derekwaynecarr marked this conversation as resolved.
|
||
| |PIDs limits|Not changed, same naming| | ||
| |hugetlb|Added to linux-next, targeting Linux 5.6| | ||
|
|
||
| ### cgroup namespace | ||
|
|
||
| A cgroup namespace restricts the view on the cgroups. When | ||
|
giuseppe marked this conversation as resolved.
|
||
| unshare(CLONE_NEWCGROUP) is done, the current cgroup the process | ||
| resides in becomes the root. Other cgroups won't be visible from the | ||
| new namespace. It was not enabled by default on a cgroup v1 system as | ||
| older kernel lacked support for it. | ||
|
|
||
|
giuseppe marked this conversation as resolved.
|
||
| Privileged pods will still use the host cgroup namespace so to have | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems inconsistent with other namespaces. On cgroup v2 systems, the host cgroup namespace should be enabled only if
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make this customizable, especially for nested containers: |
||
| visibility on all the other cgroups. | ||
|
|
||
| ### Phase 1: Convert from cgroups v1 settings to v2 | ||
|
|
||
|
giuseppe marked this conversation as resolved.
|
||
| We can convert the values passed by the k8s in cgroups v1 from to | ||
|
giuseppe marked this conversation as resolved.
|
||
| cgroups v2 so Kubernetes users don’t have to change what they specify | ||
| in their manifests. | ||
|
|
||
| crun has implemented the conversion as follows: | ||
|
|
||
| **Memory controller** | ||
|
|
||
| | OCI (x) | cgroup 2 value (y) | conversion | comment | | ||
| |---|---|---|---| | ||
| | limit | memory.max | y = x || | ||
| | swap | memory.swap.max | y = x || | ||
| | reservation | memory.low | y = x || | ||
|
|
||
| **PIDs controller** | ||
|
|
||
| | OCI (x) | cgroup 2 value (y) | conversion | comment | | ||
| |---|---|---|---| | ||
| | limit | pids.max | y = x || | ||
|
|
||
| **CPU controller** | ||
|
|
||
| | OCI (x) | cgroup 2 value (y) | conversion | comment | | ||
| |---|---|---|---| | ||
| | shares | cpu.weight | y = (1 + ((x - 2) * 9999) / 262142) | convert from [2-262144] to [1-10000]| | ||
| | period | cpu.max | y = x| period and quota are written together| | ||
| | quota | cpu.max | y = x| period and quota are written together| | ||
|
|
||
| **blkio controller** | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any kuberente versions and plans to support blkio controller ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
kubernetes/enhancements#1907 the KEP is WIP. |
||
|
|
||
| | OCI (x) | cgroup 2 value (y) | conversion | comment | | ||
| |---|---|---|---| | ||
| | weight | io.bfq.weight | y = (1 + (x - 10) * 9999 / 990) | convert linearly from [10-1000] to [1-10000]| | ||
| | weight_device | io.bfq.weight | y = (1 + (x - 10) * 9999 / 990) | convert linearly from [10-1000] to [1-10000]| | ||
| |rbps|io.max|y=x|| | ||
| |wbps|io.max|y=x|| | ||
| |riops|io.max|y=x|| | ||
| |wiops|io.max|y=x|| | ||
|
|
||
| **cpuset controller** | ||
|
|
||
| | OCI (x) | cgroup 2 value (y) | conversion | comment | | ||
| |---|---|---|---| | ||
| | cpus | cpuset.cpus | y = x || | ||
| | mems | cpuset.mems | y = x || | ||
|
|
||
| **hugetlb controller** | ||
|
|
||
| | OCI (x) | cgroup 2 value (y) | conversion | comment | | ||
| |---|---|---|---| | ||
| | <PAGE_SIZE>.limit_in_bytes | hugetlb.<PAGE_SIZE>.max | y = x || | ||
|
|
||
| With this approach cAdvisor would have to read back values from | ||
| cgroups v2 files (already done). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, with this option, CRI implementations will need to map the cgroup v1 fields to cgroup v2 values, right? Could you list what each layer needs to do for each option to make it clear?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the changes in the CRI implementation and OCI runtime might be implemented in a different way, deciding where to draw the line. In the second phase though, when cgroup v2 is fully supported through the stack there is need to change both pod specs+CRI.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you clarify why you feel the pod spec needs to change? i see no major reason to change the pod spec or resource representation.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was assuming we'd like to expose all/most of the cgroup v2 features. e.g. the memory controller on cgroup v1 allows to configure:
while on cgroup v2 we have:
but that is probably out of the scope as each new feature (if needed) must go through its own KEP?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @derekwaynecarr are future improvements based on what cgroup 2 offers out of scope for the current KEP?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @giuseppe future improvements for what cgroup v2 offers should be out of scope of this kep. i would keep this kep focused on kubelet is tolerant of cgroup v2 host. adding new cgroup v2 specific features to resource model would be a separate enhancement. To address @yujuhong question, I think we are saying the following:
@giuseppe agree with above? |
||
|
|
||
| Kubelet PR: [https://github.com/kubernetes/kubernetes/pull/85218](https://github.com/kubernetes/kubernetes/pull/85218) | ||
|
|
||
| ### Phase 2: Use cgroups v2 throughout the stack | ||
|
|
||
| This option means that the values are written directly to cgroups v2 | ||
| by the runtime. The Kubelet doesn’t do any conversion when setting | ||
| these values over the CRI. We will need to add a cgroups v2 specific | ||
|
giuseppe marked this conversation as resolved.
|
||
| LinuxContainerResources to the CRI. | ||
|
|
||
| This depends upon the container runtimes like runc and crun to be able | ||
| to write cgroups v2 values directly. | ||
|
|
||
| OCI will need support for cgroups v2 and CRI implementations will | ||
| write to the cgroups v2 section of the new OCI runtime config.json. | ||
|
|
||
| ## Risk and Mitigations | ||
|
|
||
| Some cgroups v1 features are not available with cgroups v2: | ||
|
|
||
| - _cpuacct.usage_percpu_ | ||
| - network stats from cgroup | ||
|
giuseppe marked this conversation as resolved.
|
||
|
|
||
| Some cgroups v1 controllers such as _device_ and _net_cls_, | ||
| _net_prio_ are not available with the new version. The alternative to | ||
| these controllers is to use eBPF. | ||
|
|
||
| ## Graduation Criteria | ||
|
giuseppe marked this conversation as resolved.
|
||
|
|
||
| - Alpha: Phase 1 completed and basic support for running Kubernetes on | ||
| a cgroups v2 host, e2e tests coverage or have a plan for the | ||
| failing tests. | ||
| A good candidate for running cgroup v2 test is Fedora 31 that has | ||
| already switched to default to cgroup v2. | ||
|
|
||
| - Beta: e2e tests coverage and performance testing. | ||
|
|
||
| - GA: Assuming no negative user feedback based on production | ||
| experience, promote after 2 releases in beta. | ||
| *TBD* whether phase 2 must be implemented for GA. | ||
Uh oh!
There was an error while loading. Please reload this page.