KEP-4033: Discover cgroup driver from CRI#4034
KEP-4033: Discover cgroup driver from CRI#4034k8s-ci-robot merged 13 commits intokubernetes:masterfrom
Conversation
marquiz
commented
May 25, 2023
- One-line PR description: KEP for discovering kubelet cgroup driver from CRI
- Issue link: Discover cgroup driver from CRI #4033
- Other comments:
|
/cc haircommander |
4dda017 to
8c2216f
Compare
|
/assign @BenTheElder |
|
/cc |
|
API proposal updated. It now has |
|
/cc |
|
thanks for taking a stab at this. for the record, this (driver mismatch) has been the most common, yet rather unclear to users, error for low level setups such as kubeadm. |
| This enhancement adds the ability for the container runtime to tell kubelet | ||
| which cgroup driver to use. This removes the need for specifying cgroup driver | ||
| in the kubelet configuration and eliminates the possibility of misaligned | ||
| cgroup driver configuration between the kubelet and the runtime. |
There was a problem hiding this comment.
tell kubelet which cgroup driver to use
Is this an instruction, or a hint?
There was a problem hiding this comment.
an instruction -> reworded to instructed
|
|
||
| ### Goals | ||
|
|
||
| - make kubelet automatically use the same cgroup driver as the container |
There was a problem hiding this comment.
| - make kubelet automatically use the same cgroup driver as the container | |
| - allow kubelet to automatically use the same cgroup driver as the container |
?
(I think we'd support an override mechanism, but maybe I'm wrong)
There was a problem hiding this comment.
No we don't. There should be no reason to override as the cgroup driver setting of and the kubelet and runtime really must be aligned (the same). Anything else is misconfiguration and effectively a "broken" (partially or fully) node
There was a problem hiding this comment.
agreed. it may make some sense to let a CR expose cgroup driver config to the user. but the kubelet cgroup driver CLI flag and v1beta1 config option must be removed eventually. it should be automatically detected.
to me, it makes sense to outline this deprecation work in the kep and then log an issue in k/k for sig node to track. but(!!) these cli/config options are so widely used that inevitably, and despite release notes, the removal will break some users.
There was a problem hiding this comment.
it may make some sense to let a CR expose cgroup driver config to the user
they could potentially do this by having a CR that configures the CRI implementation, but I think the larger intention of this enhancement is to take the configuration of cgroup driver away from kubelet+kubernetes in general
There was a problem hiding this comment.
Proposal updated: now the "override" mechanism would be to disable the feature gate. @sftim this resolved?
There was a problem hiding this comment.
I still recommend the wording change, unless there's a reason it makes the KEP less useful
There was a problem hiding this comment.
Hmm, might be hair-splitting but I feel that "allow" changes the aim of this proposal. The goal is for kubelet to obey what runtime tells it to do (when the feature is enabled), not to opt out for something else. Maybe it's my non-native language skills... WDYT @haircommander @sftim ?
| marked as deprecated, to be dropped when the runtimes the Kubelet is supported | ||
| to run with all support the flag. |
There was a problem hiding this comment.
AIUI we do not publish a list of supported vs. unsupported container runtimes.
(We do tell people what we test against, but I don't think we - Kubernetes - make a public claim about compatibility with some runtimes and not others).
There was a problem hiding this comment.
True. Though in practice we only have a few implementations. We could drop the deprecation part of the proposal, as the flag is still valid and necessary with runtime versions that doesn't support this. Any thoughts @haircommander?
There was a problem hiding this comment.
We could also be explicit in what we mean by all. Members of the cri-o and containerd community have signed-off on the proposal. I am not aware of cri-dockerd's awareness. We could change the wording to
to be dropped when support for the field is adopted by CRI-O and containerd.
There was a problem hiding this comment.
I now dropped the sentence about deprecating --cgroup-driver. I think dropping the flag will require wide/ubiquitous adoption of "fresh" versions of the container runtimes that support this feature. Simpler to not speculate on that in this KEP. WDYT @haircommander @sftim ?
There was a problem hiding this comment.
tbh I think we should deprecate so users have warning that they should remove it from their scripts and the like. I don't think we need to be explicit about when we will drop it.
The --cgroup-driver option has been deprecated and will be dropped in a future release. Please upgrade to a CRI implementation that supports cgroup-driver detection.
or something
There was a problem hiding this comment.
OK, added back with the following wording:
Further, the kubeletConfig field and `--cgroup-driver` flag will be
marked as deprecated, to be dropped when support for the feature is adopted by
CRI-O and containerd. Usage of the deprecated setting will produce a log
message, e.g.:
"cgroupDriver option has been deprecated and will be dropped in a future release. Please upgrade to a CRI implementation that supports cgroup-driver detection."
| Recall that end users cannot usually observe component logs or access metrics. | ||
| --> | ||
|
|
||
| No metrics likely will expose this. |
There was a problem hiding this comment.
How about exposing a cgroup driver time series? It'd be low cardinality I think.
There was a problem hiding this comment.
In practice, this setting not expected to change. Metrics for configuration settings deserves a separate proposal, if desired, I think. @haircommander
There was a problem hiding this comment.
If we're loading it from third party software, then surely it's not a configuration setting?
(imagine if you update to a new system image for nodes with a different container runtime, and now you have a mix of cgroup v2 and the future cgroup v3 - you might care)
There was a problem hiding this comment.
yeah I guess I also find myself wondering how useful such a metric would be. cgroup version (1, 2, or future 3) is not covered in the scope of this KEP--it's automatically discovered by both kubelet and cri. I don't see any reason a cluster admin would want a metric to see the cgroup driver.
There was a problem hiding this comment.
How about displaying it in Node Status?
There was a problem hiding this comment.
Would be possible, of course, but I'd like to keep Kubernetes API changes out of the scope of this KEP if possible.
|
|
||
| # The following PRR answers are required at alpha release | ||
| # List the feature gate name and the components for which it must be enabled | ||
| feature-gates: [] |
There was a problem hiding this comment.
As well as changing the CRI protocol, I'd expect to see a feature gate (even one that's on by default) to control whether the kubelet falls back to manual configuration or honors what the container runtime reports.
There was a problem hiding this comment.
If available the cgroup driver information received from the container runtime will take precedence over cgroupDriver setting from the kubelet config (or
--cgroup-drivercommand line flag).
It doesn't sound like there's any other way to opt out. In that case we might want to mark the feature as beta from the off.
There was a problem hiding this comment.
I really can't think of a scenario where manual override would be desired. If the settings are not aligned the node is misconfigured and not working properly. This is exactly the (commmon'ish) misconfiguration scenario what we're trying to prevent with the proposal. You're right with the straight to beta 🤔 Thoughts @haircommander?
There was a problem hiding this comment.
personally, I don't think a feature gate is necessary. I think the kubelet should be robust enough to fallback to its own cgroup-driver for a couple of releases while broad adoption of newer cri's takes place. Eventually, it can drop support for its own cgroup-driver. we shouldn't drop the flag until we're confident the runtimes are present.
It doesn't sound like there's any other way to opt out
I don't think we should provide a way to opt-out tbh. the runtime already needs to be configured manually for a cgroup driver, so the cluster admin already needs to know that cgroup-driver should be configured. The only difference is the cluster admin now has one place fewer to do that configuration.
There was a problem hiding this comment.
feature flags are not meant to be used as opt-in or opt-out flags, they are meant to graduate and eventually being removed. They are used to give the project the opportunity to roll back changes on multiple releases, if something unexpected happen on this feature and we introduce this change directly, we'll be releasing a core component with a bug that we can only solve with a new release.
As Tim said, when we are confident with the change sometimes we start with a feature gate enabled by default in beta, but this change is introducing a behavior change, it must have a feature gate and provide a graduation criteria (it should work for containerd and crio per example and no bugs reported during one release, per example) IMHO
There was a problem hiding this comment.
Another thing that a feature gate can help with is if we need to change implementation details around how we merge or override we are protected and can stay in alpha to do that.
|
|
||
| The responsibility of managing the Linux cgroups is currently split between the | ||
| kubelet and the container runtime. Kubelet takes care of the pod (sandbox) | ||
| level cgroups whereas the runtime is responsible for per-container cgroups. |
There was a problem hiding this comment.
Probably silly question. Why don't we want to delegate sandbox cgroups management to a runtime? If this is doable we don't need to synchronise this between Kubelet and runtime anymore.
There was a problem hiding this comment.
Is it because we also want to control the namespaces that are associated with things in the sandbox cgroup?
There was a problem hiding this comment.
I don't think it's a silly question. I think we'll get back to that at some other KEP :) That's just soo much bigger topic and "problem" to tackle. And would take time to adopt even if we had the code merged in the runtimes' mainline now.
There was a problem hiding this comment.
Is it because we also want to control the namespaces that are associated with things in the sandbox cgroup?
I think it's for historical reasons (docker) but others probably know better than me.
| logs or events for this purpose. | ||
| --> | ||
|
|
||
| Kubelet and container runtime version. |
There was a problem hiding this comment.
If I know these version values, and I don't have a detailed understanding of the behavior of my preferred container runtime, can I work out whether or not the kubelet is automatically learning which cgroup driver to use?
If not, I'd like us to add that. It's much easier for us to document “look at this metrics query” or “look for this log line” vs. “go and read the docs and / or source code for some third party component”.
We should add this for the first release where the new behavior will be on by default, and optionally for any earlier releases.
This question is not the same as can I work out which driver is in use; it's about working out whether or not the kubelet is overriding my configuration. (It is also useful to know which driver is in use).
There was a problem hiding this comment.
would having the output of crictl info display the cgroupDriver be sufficient? it currently reports the runtime status info and would likely be extended to include it, which would provide a runtime agnostic way of polling
There was a problem hiding this comment.
I added a mention of crictl info for determining the runtime-side support. WDYT @sftim
| // List of current observed runtime conditions. | ||
| repeated RuntimeCondition conditions = 1; | ||
| + // Configuration settings of the runtime | ||
| + RuntimeConfiguration configuration = 2; |
There was a problem hiding this comment.
I think we should split this into a separate CRI call.
There was a problem hiding this comment.
@mrunalp I think in this case in this case we should add a streaming interface for the runtime to be able to notify about dynamic changes (not for the cgroup driver but for other potentially more dynamic stuff in the future). Adding yet another endpoint that get's periodically polled doesn't sound reasonable to me. WDYT? @haircommander ?
There was a problem hiding this comment.
I think the agreeement was for this KEP to be simple and for this use case. We could add a streaming interface later if needed for other use cases?
There was a problem hiding this comment.
yeah I think something worth noting is we're not preparred in this KEP to react to the runtime changing its mind. I don't think the kubelet should update the cgroup driver if the cri changes without a kubelet restart. from that perspective, separating the calls may be appropriate
There was a problem hiding this comment.
I don't think the kubelet should update the cgroup driver if the cri changes
I agree. But I was thinking this from the CRI API pov. So we would add the simple "unary" rpc for now and then later with the dynamic stuff add a parallel streaming rpc (for basically the same data, I think). Is this the way to go?
There was a problem hiding this comment.
yeah I find myself leaning towards having a separate API call for the streaming stuff, part to reduce the info going over the wire, part to make clear how the kubelet will react to the data
There was a problem hiding this comment.
OK, changed the proposal to add a new RuntimeConfig rpc for querying the runtime configuration. @mrunalp @haircommander WDYT?
| // List of current observed runtime conditions. | ||
| repeated RuntimeCondition conditions = 1; | ||
| + // Configuration settings of the runtime | ||
| + RuntimeConfiguration configuration = 2; |
There was a problem hiding this comment.
This should be a map that takes runtime handler names as the keys:
https://github.com/kubernetes/enhancements/pull/3858/files#diff-057b8627f24bc6a0742b51b4fea113938a3440eafe255edecc834f3301c008fcR471
There was a problem hiding this comment.
Regarding the runtime configuration we'd prolly need a "system-wide config" that is independent of the runtime class and the runtime handler specific stuff. I think the cgroup driver would fall in the first category – even though at least in containerd it is technically possible to have different cgroup driver for different runtime handlers I can't think of a practical scenario where this would make sense (and kubelet can anyways handle only one)
Also, I think we've tried to prefer lists with named fields over maps also in CRI API.
Thoughts @mrunalp @haircommander @mikebrow ?
There was a problem hiding this comment.
kubelet can anyways handle only one
👍
Then this does not need to be a map, but I still expect the proto file to have a comment line to state that this is a global config that is decoupled from individual runtime handlers.
There was a problem hiding this comment.
I added a comment along these lines in the proposal (in the .proto diff)
|
Updated the KEP, addressing review comments. However, this comment is still unaddressed, waiting for feedback on that one |
| If available the cgroup driver information received from the container runtime | ||
| will take precedence over cgroupDriver setting from the kubelet config (or | ||
| `--cgroup-driver` command line flag). If the runtime does not provide | ||
| information about the cgroup driver, then kubelet will fall back to using its | ||
| own configuration (`cgroupDriver` from kubeletConfig or the `--cgroup-driver` | ||
| flag). | ||
|
|
||
| Kubelet startup is modified so that connection to the CRI server (container | ||
| runtime) is established and RuntimeStatus is queried before initializing the | ||
| kubelet internal container-manager which is responsible for kubelet-side cgroup | ||
| management. |
There was a problem hiding this comment.
This is an important change of behavior, so if I have kubelet with hardcoded cgroup driver in version X and I upgrade to version X+1 I may have my kubelet running with a different configuration without noticing?
There was a problem hiding this comment.
I'd argue in this case N+1 would then actually fix your node. Logically there should be only one configuration point for this setting. Having it both in kubelet and container runtime is a constant source of problems as if your runtime has configured driver Y and kubelet has Z then your node is borken
|
@johnbelamaric thanks for the reviw. We'll think what to do about this. I'm confident that this is such a simple change that any problems in kubelet (can only show up in kubelet startup) could (and should) be treated and fixed as a "normal" bug. |
|
@johnbelamaric I've updated the enhancement to target to alpha. Thanks for your feedback, PTAL |
- reword tell -> instruct - fix SIG-Node -> SIG Node
- fix typos - drop mention of deprecating kubelet --cgroup-driver flag - mention "crictl info" as a way to determine if feature is supported by the runtime - add a comment in the cri api proto that the config here is for "global" configuration options
- changed CRI API to have a new RPC for querying for runtime config (instead of re-using runtime Status) - update kep.yaml: add reviewers and approvers - added feature gate (enabled by default) - changed target matutiry to beta - add back the deprecation warning about cgroupDriver kubelet config setting - small updates to PRR
Co-authored-by: Peter Hunt~ <[email protected]>
In response to feedback from johnbelamaric.
2387bd2 to
5a5fef6
Compare
5a5fef6 to
a5387f6
Compare
Signed-off-by: Peter Hunt <[email protected]>
a5387f6 to
caadc0f
Compare
|
Thanks for the updates. /approve |
| logs or events for this purpose. | ||
| --> | ||
|
|
||
| Kubelet and container runtime version. The |
There was a problem hiding this comment.
non-blocking comment: since there is a feature gate, the operator can see which nodes it is enabled on via a metric. Please note this in later KEP update.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric, marquiz, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |