KEP-4033: Discover cgroup driver from CRI by marquiz · Pull Request #4034 · kubernetes/enhancements

marquiz · 2023-05-25T11:21:29Z

One-line PR description: KEP for discovering kubelet cgroup driver from CRI

Issue link: Discover cgroup driver from CRI #4033

Other comments:

marquiz · 2023-05-25T11:21:47Z

/cc haircommander

aojea · 2023-05-26T08:20:12Z

/assign @BenTheElder

ffromani · 2023-05-26T14:10:41Z

/cc

marquiz · 2023-05-26T17:11:37Z

API proposal updated. It now has LinuxRuntimeConfiguration sub-type under which the cgroup driver info resides. It's possible to add corresponding windows sub-type (and common(?)) in the future

bart0sh · 2023-05-29T06:40:55Z

/cc

neolit123 · 2023-05-30T18:54:15Z

cc @SataQiu @pacoxu @chendave

neolit123 · 2023-05-30T18:56:12Z

thanks for taking a stab at this. for the record, this (driver mismatch) has been the most common, yet rather unclear to users, error for low level setups such as kubeadm.

sftim · 2023-06-03T10:18:45Z

+This enhancement adds the ability for the container runtime to tell kubelet
+which cgroup driver to use. This removes the need for specifying cgroup driver
+in the kubelet configuration and eliminates the possibility of misaligned
+cgroup driver configuration between the kubelet and the runtime.


tell kubelet which cgroup driver to use

Is this an instruction, or a hint?

an instruction -> reworded to instructed

sftim · 2023-06-03T10:19:42Z

+
+### Goals
+
+- make kubelet automatically use the same cgroup driver as the container


Suggested change

- make kubelet automatically use the same cgroup driver as the container

- allow kubelet to automatically use the same cgroup driver as the container

?

(I think we'd support an override mechanism, but maybe I'm wrong)

No we don't. There should be no reason to override as the cgroup driver setting of and the kubelet and runtime really must be aligned (the same). Anything else is misconfiguration and effectively a "broken" (partially or fully) node

agreed. it may make some sense to let a CR expose cgroup driver config to the user. but the kubelet cgroup driver CLI flag and v1beta1 config option must be removed eventually. it should be automatically detected.

to me, it makes sense to outline this deprecation work in the kep and then log an issue in k/k for sig node to track. but(!!) these cli/config options are so widely used that inevitably, and despite release notes, the removal will break some users.

it may make some sense to let a CR expose cgroup driver config to the user

they could potentially do this by having a CR that configures the CRI implementation, but I think the larger intention of this enhancement is to take the configuration of cgroup driver away from kubelet+kubernetes in general

Proposal updated: now the "override" mechanism would be to disable the feature gate. @sftim this resolved?

I still recommend the wording change, unless there's a reason it makes the KEP less useful

Hmm, might be hair-splitting but I feel that "allow" changes the aim of this proposal. The goal is for kubelet to obey what runtime tells it to do (when the feature is enabled), not to opt out for something else. Maybe it's my non-native language skills... WDYT @haircommander @sftim ?

yeah I agree @marquiz

sftim · 2023-06-03T10:24:46Z

+marked as deprecated, to be dropped when the runtimes the Kubelet is supported
+to run with all support the flag.


AIUI we do not publish a list of supported vs. unsupported container runtimes.

(We do tell people what we test against, but I don't think we - Kubernetes - make a public claim about compatibility with some runtimes and not others).

True. Though in practice we only have a few implementations. We could drop the deprecation part of the proposal, as the flag is still valid and necessary with runtime versions that doesn't support this. Any thoughts @haircommander?

We could also be explicit in what we mean by all. Members of the cri-o and containerd community have signed-off on the proposal. I am not aware of cri-dockerd's awareness. We could change the wording to

to be dropped when support for the field is adopted by CRI-O and containerd.

I now dropped the sentence about deprecating --cgroup-driver. I think dropping the flag will require wide/ubiquitous adoption of "fresh" versions of the container runtimes that support this feature. Simpler to not speculate on that in this KEP. WDYT @haircommander @sftim ?

tbh I think we should deprecate so users have warning that they should remove it from their scripts and the like. I don't think we need to be explicit about when we will drop it.

The --cgroup-driver option has been deprecated and will be dropped in a future release. Please upgrade to a CRI implementation that supports cgroup-driver detection.

or something

OK, added back with the following wording:

Further, the kubeletConfig field and `--cgroup-driver` flag will be marked as deprecated, to be dropped when support for the feature is adopted by CRI-O and containerd. Usage of the deprecated setting will produce a log message, e.g.: "cgroupDriver option has been deprecated and will be dropped in a future release. Please upgrade to a CRI implementation that supports cgroup-driver detection."

sftim · 2023-06-03T10:26:38Z

+Recall that end users cannot usually observe component logs or access metrics.
+-->
+
+No metrics likely will expose this.


How about exposing a cgroup driver time series? It'd be low cardinality I think.

In practice, this setting not expected to change. Metrics for configuration settings deserves a separate proposal, if desired, I think. @haircommander

If we're loading it from third party software, then surely it's not a configuration setting?

(imagine if you update to a new system image for nodes with a different container runtime, and now you have a mix of cgroup v2 and the future cgroup v3 - you might care)

yeah I guess I also find myself wondering how useful such a metric would be. cgroup version (1, 2, or future 3) is not covered in the scope of this KEP--it's automatically discovered by both kubelet and cri. I don't see any reason a cluster admin would want a metric to see the cgroup driver.

How about displaying it in Node Status?

Would be possible, of course, but I'd like to keep Kubernetes API changes out of the scope of this KEP if possible.

sftim · 2023-06-03T10:28:41Z

+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates: []


As well as changing the CRI protocol, I'd expect to see a feature gate (even one that's on by default) to control whether the kubelet falls back to manual configuration or honors what the container runtime reports.

If available the cgroup driver information received from the container runtime will take precedence over cgroupDriver setting from the kubelet config (or --cgroup-driver command line flag).

It doesn't sound like there's any other way to opt out. In that case we might want to mark the feature as beta from the off.

I really can't think of a scenario where manual override would be desired. If the settings are not aligned the node is misconfigured and not working properly. This is exactly the (commmon'ish) misconfiguration scenario what we're trying to prevent with the proposal. You're right with the straight to beta 🤔 Thoughts @haircommander?

personally, I don't think a feature gate is necessary. I think the kubelet should be robust enough to fallback to its own cgroup-driver for a couple of releases while broad adoption of newer cri's takes place. Eventually, it can drop support for its own cgroup-driver. we shouldn't drop the flag until we're confident the runtimes are present.

It doesn't sound like there's any other way to opt out

I don't think we should provide a way to opt-out tbh. the runtime already needs to be configured manually for a cgroup driver, so the cluster admin already needs to know that cgroup-driver should be configured. The only difference is the cluster admin now has one place fewer to do that configuration.

feature flags are not meant to be used as opt-in or opt-out flags, they are meant to graduate and eventually being removed. They are used to give the project the opportunity to roll back changes on multiple releases, if something unexpected happen on this feature and we introduce this change directly, we'll be releasing a core component with a bug that we can only solve with a new release.

As Tim said, when we are confident with the change sometimes we start with a feature gate enabled by default in beta, but this change is introducing a behavior change, it must have a feature gate and provide a graduation criteria (it should work for containerd and crio per example and no bugs reported during one release, per example) IMHO

Another thing that a feature gate can help with is if we need to change implementation details around how we merge or override we are protected and can stay in alpha to do that.

OK, looks like people want this so I added feature gate (KubeletCgroupDriverFromCRI) in the proposal. It is proposed to be enabled by default, and thus KEP maturity would be beta directly. @sftim @aojea @mrunalp did I get it right?

bart0sh · 2023-06-04T11:22:52Z

+
+The responsibility of managing the Linux cgroups is currently split between the
+kubelet and the container runtime. Kubelet takes care of the pod (sandbox)
+level cgroups whereas the runtime is responsible for per-container cgroups.


Probably silly question. Why don't we want to delegate sandbox cgroups management to a runtime? If this is doable we don't need to synchronise this between Kubelet and runtime anymore.

Is it because we also want to control the namespaces that are associated with things in the sandbox cgroup?

I don't think it's a silly question. I think we'll get back to that at some other KEP :) That's just soo much bigger topic and "problem" to tackle. And would take time to adopt even if we had the code merged in the runtimes' mainline now.

Is it because we also want to control the namespaces that are associated with things in the sandbox cgroup?

I think it's for historical reasons (docker) but others probably know better than me.

sftim · 2023-06-05T11:36:58Z

+logs or events for this purpose.
+-->
+
+Kubelet and container runtime version.


If I know these version values, and I don't have a detailed understanding of the behavior of my preferred container runtime, can I work out whether or not the kubelet is automatically learning which cgroup driver to use?

If not, I'd like us to add that. It's much easier for us to document “look at this metrics query” or “look for this log line” vs. “go and read the docs and / or source code for some third party component”.

We should add this for the first release where the new behavior will be on by default, and optionally for any earlier releases.

This question is not the same as can I work out which driver is in use; it's about working out whether or not the kubelet is overriding my configuration. (It is also useful to know which driver is in use).

would having the output of crictl info display the cgroupDriver be sufficient? it currently reports the runtime status info and would likely be extended to include it, which would provide a runtime agnostic way of polling

I added a mention of crictl info for determining the runtime-side support. WDYT @sftim

This now states:

Kubelet and container runtime version. The crictl tool can be used to determine if the container runtime supports the feature (crictl info).

@sftim do you consider this as resolved?

mrunalp · 2023-06-10T03:00:11Z

+     // List of current observed runtime conditions.
+     repeated RuntimeCondition conditions = 1;
+    // Configuration settings of the runtime
+    RuntimeConfiguration configuration = 2;


I think we should split this into a separate CRI call.

@mrunalp I think in this case in this case we should add a streaming interface for the runtime to be able to notify about dynamic changes (not for the cgroup driver but for other potentially more dynamic stuff in the future). Adding yet another endpoint that get's periodically polled doesn't sound reasonable to me. WDYT? @haircommander ?

I think the agreeement was for this KEP to be simple and for this use case. We could add a streaming interface later if needed for other use cases?

yeah I think something worth noting is we're not preparred in this KEP to react to the runtime changing its mind. I don't think the kubelet should update the cgroup driver if the cri changes without a kubelet restart. from that perspective, separating the calls may be appropriate

I don't think the kubelet should update the cgroup driver if the cri changes

I agree. But I was thinking this from the CRI API pov. So we would add the simple "unary" rpc for now and then later with the dynamic stuff add a parallel streaming rpc (for basically the same data, I think). Is this the way to go?

yeah I find myself leaning towards having a separate API call for the streaming stuff, part to reduce the info going over the wire, part to make clear how the kubelet will react to the data

OK, changed the proposal to add a new RuntimeConfig rpc for querying the runtime configuration. @mrunalp @haircommander WDYT?

AkihiroSuda · 2023-06-12T13:29:38Z

+     // List of current observed runtime conditions.
+     repeated RuntimeCondition conditions = 1;
+    // Configuration settings of the runtime
+    RuntimeConfiguration configuration = 2;


~~This should be a map that takes runtime handler names as the keys~~:
https://github.com/kubernetes/enhancements/pull/3858/files#diff-057b8627f24bc6a0742b51b4fea113938a3440eafe255edecc834f3301c008fcR471

Regarding the runtime configuration we'd prolly need a "system-wide config" that is independent of the runtime class and the runtime handler specific stuff. I think the cgroup driver would fall in the first category – even though at least in containerd it is technically possible to have different cgroup driver for different runtime handlers I can't think of a practical scenario where this would make sense (and kubelet can anyways handle only one)

Also, I think we've tried to prefer lists with named fields over maps also in CRI API.

Thoughts @mrunalp @haircommander @mikebrow ?

kubelet can anyways handle only one

👍

Then this does not need to be a map, but I still expect the proto file to have a comment line to state that this is a global config that is decoupled from individual runtime handlers.

I added a comment along these lines in the proposal (in the .proto diff)

marquiz · 2023-06-13T13:08:26Z

Updated the KEP, addressing review comments.

However, this comment is still unaddressed, waiting for feedback on that one

aojea · 2023-06-13T17:24:43Z

+If available the cgroup driver information received from the container runtime
+will take precedence over cgroupDriver setting from the kubelet config (or
+`--cgroup-driver` command line flag). If the runtime does not provide
+information about the cgroup driver, then kubelet will fall back to using its
+own configuration (`cgroupDriver` from kubeletConfig or the `--cgroup-driver`
+flag).
+
+Kubelet startup is modified so that connection to the CRI server (container
+runtime) is established and RuntimeStatus is queried before initializing the
+kubelet internal container-manager which is responsible for kubelet-side cgroup
+management.


This is an important change of behavior, so if I have kubelet with hardcoded cgroup driver in version X and I upgrade to version X+1 I may have my kubelet running with a different configuration without noticing?

I'd argue in this case N+1 would then actually fix your node. Logically there should be only one configuration point for this setting. Having it both in kubelet and container runtime is a constant source of problems as if your runtime has configured driver Y and kubelet has Z then your node is borken

marquiz · 2023-06-15T22:00:01Z

@johnbelamaric thanks for the reviw. We'll think what to do about this.

I'm confident that this is such a simple change that any problems in kubelet (can only show up in kubelet startup) could (and should) be treated and fixed as a "normal" bug.

haircommander · 2023-06-15T22:36:41Z

@johnbelamaric I've updated the enhancement to target to alpha. Thanks for your feedback, PTAL

- reword tell -> instruct - fix SIG-Node -> SIG Node

- fix typos - drop mention of deprecating kubelet --cgroup-driver flag - mention "crictl info" as a way to determine if feature is supported by the runtime - add a comment in the cri api proto that the config here is for "global" configuration options

- changed CRI API to have a new RPC for querying for runtime config (instead of re-using runtime Status) - update kep.yaml: add reviewers and approvers - added feature gate (enabled by default) - changed target matutiry to beta - add back the deprecation warning about cgroupDriver kubelet config setting - small updates to PRR

Co-authored-by: Peter Hunt~ <[email protected]>

In response to feedback from johnbelamaric.

Signed-off-by: Peter Hunt <[email protected]>

mrunalp · 2023-06-15T23:11:04Z

Thanks for the updates.

/approve

johnbelamaric

/approve
/lgtm

johnbelamaric · 2023-06-16T00:10:37Z

+logs or events for this purpose.
+-->
+
+Kubelet and container runtime version. The


non-blocking comment: since there is a feature gate, the operator can see which nodes it is enabled on via a metric. Please note this in later KEP update.

k8s-ci-robot · 2023-06-16T00:12:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, marquiz, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [johnbelamaric,mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 25, 2023

k8s-ci-robot requested review from dchen1107 and derekwaynecarr May 25, 2023 11:21

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 25, 2023

k8s-ci-robot requested a review from haircommander May 25, 2023 11:21

marquiz force-pushed the devel/kep-cgroup-driver branch from 4dda017 to 8c2216f Compare May 25, 2023 11:34

adisky reviewed May 26, 2023

View reviewed changes

Comment thread keps/sig-node/4033-group-driver-detection-over-cri/README.md Outdated

adisky reviewed May 26, 2023

View reviewed changes

Comment thread keps/sig-node/4033-group-driver-detection-over-cri/README.md

k8s-ci-robot assigned BenTheElder May 26, 2023

k8s-ci-robot requested a review from ffromani May 26, 2023 14:10

k8s-ci-robot requested a review from bart0sh May 29, 2023 06:40

marquiz mentioned this pull request May 30, 2023

Discover cgroup driver from CRI #4033

Closed

13 tasks

sftim reviewed Jun 3, 2023

View reviewed changes

bart0sh reviewed Jun 4, 2023

View reviewed changes

sftim reviewed Jun 5, 2023

View reviewed changes

pacoxu mentioned this pull request Jun 6, 2023

default the "cgroupDriver" setting of the kubelet to "systemd" kubernetes/kubeadm#2376

Closed

4 tasks

mrunalp reviewed Jun 10, 2023

View reviewed changes

AkihiroSuda reviewed Jun 12, 2023

View reviewed changes

marquiz mentioned this pull request Jun 13, 2023

KEP-3857: Recursive Read-only (RRO) mounts #3858

Merged

aojea reviewed Jun 13, 2023

View reviewed changes

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 15, 2023

marquiz and others added 12 commits June 15, 2023 18:37

KEP-4033: Discover cgroup driver from runtime

20b4997

KEP-4033: mention kubelet startup in design details

354e480

KEP-4033: change to platform-specific structs

b987c4d

KEP-4033: address review feedback from sftim

2263fb4

- reword tell -> instruct - fix SIG-Node -> SIG Node

KEP-4033: fix wording

6b7c9a8

KEP-4033: address review feedback

41633a8

- fix typos - drop mention of deprecating kubelet --cgroup-driver flag - mention "crictl info" as a way to determine if feature is supported by the runtime - add a comment in the cri api proto that the config here is for "global" configuration options

KEP-4033: agreement on the test plan

e005b6c

KEP-4033: describe error handling of kubelet

a491d12

KEP-4033: update test plan

2dca79a

KEP-4033: update unit tests plan

fcda1cf

Co-authored-by: Peter Hunt~ <[email protected]>

KEP-4033: update PRR questionnaire

0034e07

In response to feedback from johnbelamaric.

haircommander force-pushed the devel/kep-cgroup-driver branch from 2387bd2 to 5a5fef6 Compare June 15, 2023 22:37

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 15, 2023

haircommander force-pushed the devel/kep-cgroup-driver branch from 5a5fef6 to a5387f6 Compare June 15, 2023 22:38

KEP-4033: target to alpha for 1.28

caadc0f

Signed-off-by: Peter Hunt <[email protected]>

haircommander force-pushed the devel/kep-cgroup-driver branch from a5387f6 to caadc0f Compare June 15, 2023 23:05

johnbelamaric approved these changes Jun 16, 2023

View reviewed changes

k8s-ci-robot assigned johnbelamaric Jun 16, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2023

k8s-ci-robot merged commit 61ec14b into kubernetes:master Jun 16, 2023

k8s-ci-robot added this to the v1.28 milestone Jun 16, 2023

marquiz deleted the devel/kep-cgroup-driver branch June 16, 2023 04:32

dudo mentioned this pull request Jul 30, 2023

cgroupDriver: systemd k0sproject/k0s#3333

Closed

4 tasks


		### Goals

		- make kubelet automatically use the same cgroup driver as the container

	- make kubelet automatically use the same cgroup driver as the container
	- allow kubelet to automatically use the same cgroup driver as the container

		marked as deprecated, to be dropped when the runtimes the Kubelet is supported
		to run with all support the flag.

Conversation

marquiz commented May 25, 2023

Uh oh!

marquiz commented May 25, 2023

Uh oh!

Uh oh!

Uh oh!

aojea commented May 26, 2023

Uh oh!

ffromani commented May 26, 2023

Uh oh!

marquiz commented May 26, 2023

Uh oh!

bart0sh commented May 29, 2023

Uh oh!

neolit123 commented May 30, 2023

Uh oh!

neolit123 commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sftim Jun 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neolit123 Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

neolit123 commented May 30, 2023 •

edited

Loading

sftim Jun 3, 2023 •

edited

Loading

neolit123 Jun 5, 2023 •

edited

Loading