kubelet/cm: refactor, prepare for runc 1.1 bump by kolyshkin · Pull Request #108597 · kubernetes/kubernetes

kolyshkin · 2022-03-09T01:10:24Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This does some refactoring, simplification, and occasional fixing pkg/kubelet/cm,
preparing the code for runc 1.1 bump. Originally part of #107149.

Special notes for your reviewer:

Please review commit by commit. See individual commit messages for details.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

none

There's no need to call m.Update (which will create another instance of libcontainer cgroup manager, convert all the resources and then set them). All this is already done here, except for Set(). Signed-off-by: Kir Kolyshkin <[email protected]>

Commit ecd6361 added setting PidsLimit to Create and Update. Commit bce9d5f added setting PidsLimit to m.toResources. Now, PidsLimit is assigned twice. Remove the duplicate. Fixes: bce9d5f Signed-off-by: Kir Kolyshkin <[email protected]>

Commit 79be8be made hugetlb settings optional if cgroup v2 is used and hugetlb is not available, fixing issue 92933. Note at that time this was only needed for v2, because for v1 the resources were set one-by-one, and only for supported resources. Commit d312ef7 switched the code to using Set from runc/libcontainer cgroups manager, and expanded the check to cgroup v1 as well. Move this check earlier, to inside m.toResources, so instead of converting all hugetlb resources from ResourceConfig to libcontainers's Resources.HugetlbLimit, and then setting it to nil, we can skip the conversion entirely if hugetlb is not supported, thus not doing the work that is not needed. Signed-off-by: Kir Kolyshkin <[email protected]>

Instead of doing (almost) the same thing from the three different methods (Create, Update, Destroy), move the functionality to libctCgroupConfig, replacing updateSystemdCgroupInfo. The needResources bool is needed because we do not need resources during Destroy, so we skip the unneeded resource conversion. Signed-off-by: Kir Kolyshkin <[email protected]>

Move the rl == nil check to before we dereference it. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2022-03-09T01:16:09Z

@odinuge @ehashman PTAL

pacoxu · 2022-03-11T05:48:43Z

/priority important-soon
/triage accepted

kolyshkin · 2022-03-18T19:40:27Z

/assign @dchen1107

ehashman · 2022-03-18T20:34:45Z

/assign @mrunalp @ehashman

odinuge

Thanks for working on this @kolyshkin!

Looks like all my initial comments from the other PR is resolved here, so it looks good from my pov. I am a big fan of all the performance improvements here, as well as the cleanups.

Code looks good, but will prefer if someone like @ehashman can also take a quick look to validate this. Thanks

/lgtm

To be sure;
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
/test pull-kubernetes-node-kubelet-serial-hugepages
/test pull-kubernetes-node-kubelet-serial-memory-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager
/test pull-kubernetes-node-memoryqos-cgrpv2
/test pull-kubernetes-node-swap-fedora
/test pull-kubernetes-node-swap-fedora-serial
/test pull-kubernetes-node-swap-ubuntu-serial

ehashman · 2022-03-21T23:52:22Z

Looking at tests:

pull-kubernetes-node-kubelet-serial-cpu-manager
- failed to build
pull-kubernetes-node-kubelet-serial-crio-cgroupv1
pull-kubernetes-node-kubelet-serial-crio-cgroupv2
- Failing due to Issue with SSH key on k8s-infra-prow-build cluster for CRI-O nodes test-infra#24798
pull-kubernetes-node-kubelet-serial-hugepages
- failed to build
pull-kubernetes-node-kubelet-serial-memory-manager
- failed to build
pull-kubernetes-node-kubelet-serial-topology-manager
- failed to build
pull-kubernetes-node-memoryqos-cgrpv2
- "The resource 'projects/k8s-infra-e2e-boskos-011/regions/us-west1/addresses/bootstrap-e2e-master-ip' was not found"
pull-kubernetes-node-swap-fedora
pull-kubernetes-node-swap-fedora-serial
- Not surprising these failed, they aren't working due to Issue with SSH key on k8s-infra-prow-build cluster for CRI-O nodes test-infra#24798
pull-kubernetes-node-swap-ubuntu-serial
- the only job with an actual test failure:
  E2eNode Suite: [sig-node] Container Manager Misc [Serial] Validate OOM score adjustments [NodeFeature:OOMScoreAdj] once the node is setup container runtime's oom-score-adj should be -999 { Failure test/e2e_node/container_manager_test.go:81 Timed out after 300.001s. Expected <*errors.errorString \| 0xc000b28000>: { s: "expected pid 777's oom_score_adj to be -999; found -500", } to be nil test/e2e_node/container_manager_test.go:87}

I know a lot of the presubmits aren't in great shape as they weren't a priority after the dockershim migration since they're only used occasionally. Trying to launch this many node tests simultaneously might have hit capacity issues. Let's see if we can focus on just a few critical ones for this PR.

/skip

ehashman · 2022-03-21T23:57:54Z

Given that https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108597/pull-kubernetes-node-kubelet-serial-containerd/1505844761926307840/ passed I'm not super concerned by most of these failures, I think the jobs are just a bit neglected and we tried to run jobs we wouldn't normally.

I think the Ubuntu bug is a true failure. According to triage, we've only seen it on the runc 1.1 bump PRs. Is it deterministic?

/test pull-kubernetes-node-swap-ubuntu-serial

Failure cluster 5c6173278f01b3767cdf

Error text:

test/e2e_node/container_manager_test.go:81
Timed out after 300.004s.
Expected
    <*errors.errorString | 0xc0017ac050>: {
        s: "expected pid 777's oom_score_adj to be -999; found -500",
    }
to be nil
test/e2e_node/container_manager_test.go:87

Recent failures:

3/21/2022, 2:52:29 AM pr:pull-kubernetes-node-swap-ubuntu-serial
3/7/2022, 5:32:16 PM pr:pull-kubernetes-node-swap-ubuntu-serial

bobbypage · 2022-03-22T19:48:29Z

I'm a little confused on the pull-kubernetes-node-swap-ubuntu-serial failure. The test that is failing there is E2eNode Suite: [sig-node] Container Manager Misc [Serial] Validate OOM score adjustments [NodeFeature:OOMScoreAdj] once the node is setup container runtime's oom-score-adj should be -999. However, the same test is passing in the pull-kubernetes-node-kubelet-serial-containerd job (see https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108597/pull-kubernetes-node-kubelet-serial-containerd/1505844761926307840/).

Maybe this is something specific to setup of ubuntu swap job? It's weird to see the test only fail on swap job, I would expect it to either fail on both or neither :)
/cc @ehashman

bobbypage · 2022-03-22T20:23:08Z

The swap job is failing because it does not set container-runtime-process-name so the swap test is incorrectly thinking the runtime is dockerd instead of containerd. (dockerd has -500 OOM score, dockerd has -999 which is what test expects).

The serial job in contrast correctly sets container-runtime-process-name so it doesn't have this problem. kubernetes/test-infra#25726 will fix the swap job.

bobbypage · 2022-03-22T21:17:18Z

/test pull-kubernetes-node-swap-ubuntu-serial

bobbypage · 2022-03-22T21:59:32Z

/lgtm

LGTM on the code changes, thanks for all the refactoring, it's much cleaner now with less duplication of cgroup configuration.

The swap job is now passing as well after re-running with kubernetes/test-infra#25726 merged

k8s-ci-robot · 2022-03-22T22:01:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kolyshkin, mrunalp, odinuge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ehashman

/lgtm

thanks for fixing the job!

kolyshkin · 2022-03-23T01:42:01Z

Some weird errors, looks like due to go 1.18 being used by default. Let me try to rebase this.

kolyshkin · 2022-03-23T01:54:15Z

I also think that maybe golangci-lint need to be updated to 1.45.0 because of go 1.18.

Ah, indeed, there's a PR opened already: #108901 to disable the jobs. I think I might know how to fix the issue, so here's #108902

k8s-triage-robot · 2022-03-23T05:09:17Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-triage-robot · 2022-03-23T08:49:17Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-ci-robot · 2022-03-23T11:07:24Z

@kolyshkin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-node-kubelet-serial-memory-manager	`de5a69d`	link	false	`/test pull-kubernetes-node-kubelet-serial-memory-manager`
pull-kubernetes-node-kubelet-serial-cpu-manager	`de5a69d`	link	false	`/test pull-kubernetes-node-kubelet-serial-cpu-manager`
pull-kubernetes-node-kubelet-serial-topology-manager	`de5a69d`	link	false	`/test pull-kubernetes-node-kubelet-serial-topology-manager`
pull-kubernetes-node-kubelet-serial-hugepages	`de5a69d`	link	false	`/test pull-kubernetes-node-kubelet-serial-hugepages`
pull-kubernetes-node-memoryqos-cgrpv2	`de5a69d`	link	false	`/test pull-kubernetes-node-memoryqos-cgrpv2`
pull-kubernetes-node-kubelet-serial-crio-cgroupv2	`de5a69d`	link	false	`/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2`
pull-kubernetes-node-swap-fedora	`de5a69d`	link	false	`/test pull-kubernetes-node-swap-fedora`
pull-kubernetes-node-kubelet-serial-crio-cgroupv1	`de5a69d`	link	false	`/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1`
pull-kubernetes-node-swap-fedora-serial	`de5a69d`	link	false	`/test pull-kubernetes-node-swap-fedora-serial`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2022-03-23T15:23:10Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

kolyshkin added 5 commits March 8, 2022 17:05

kubelet/cm: speed up cgroup creation

a673b64

There's no need to call m.Update (which will create another instance of libcontainer cgroup manager, convert all the resources and then set them). All this is already done here, except for Set(). Signed-off-by: Kir Kolyshkin <[email protected]>

pkg/kubelet/cm: rm dup code

59148e2

Commit ecd6361 added setting PidsLimit to Create and Update. Commit bce9d5f added setting PidsLimit to m.toResources. Now, PidsLimit is assigned twice. Remove the duplicate. Fixes: bce9d5f Signed-off-by: Kir Kolyshkin <[email protected]>

pkg/kubelet/cm: fix potential nil dereference in enforceExistingCgroup

de5a69d

Move the rl == nil check to before we dereference it. Signed-off-by: Kir Kolyshkin <[email protected]>

k8s-ci-robot requested review from feiskyer and odinuge March 9, 2022 01:11

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 9, 2022

kolyshkin mentioned this pull request Mar 9, 2022

kubelet/cm: bump runc/libcontainer to 1.1.0 #107149

Closed

kolyshkin changed the title ~~pkg/kubelet/cm: refactor, prepare for runc 1.1~~ kubelet/cm: refactor, prepare for runc 1.1 bump Mar 9, 2022

k8s-ci-robot assigned dchen1107 Mar 18, 2022

k8s-ci-robot assigned ehashman and mrunalp Mar 18, 2022

odinuge approved these changes Mar 21, 2022

View reviewed changes

k8s-ci-robot assigned odinuge Mar 21, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2022

k8s-ci-robot requested a review from ehashman March 22, 2022 19:48

k8s-ci-robot assigned bobbypage Mar 22, 2022

mrunalp approved these changes Mar 22, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2022

ehashman reviewed Mar 22, 2022

View reviewed changes

k8s-ci-robot merged commit 75b19b2 into kubernetes:master Mar 23, 2022

k8s-ci-robot added this to the v1.24 milestone Mar 23, 2022

ehashman mentioned this pull request Mar 28, 2022

[sig-node] Summary API [NodeConformance] when querying /stats/summary ... networking info is nil on containerd #109082

Closed

zafs23 mentioned this pull request Jul 25, 2023

Patch k8 1.23 to Bump runc to 1.1.6 aws/eks-distro#2165

Merged

Conversation

kolyshkin commented Mar 9, 2022

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

kolyshkin commented Mar 9, 2022

Uh oh!

pacoxu commented Mar 11, 2022

Uh oh!

kolyshkin commented Mar 18, 2022

Uh oh!

ehashman commented Mar 18, 2022

Uh oh!

odinuge left a comment

Choose a reason for hiding this comment

Uh oh!

ehashman commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehashman commented Mar 21, 2022

Failure cluster 5c6173278f01b3767cdf

Error text:

Recent failures:

Uh oh!

bobbypage commented Mar 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bobbypage commented Mar 22, 2022

Uh oh!

bobbypage commented Mar 22, 2022

Uh oh!

bobbypage commented Mar 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 22, 2022

Uh oh!

ehashman left a comment

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Mar 23, 2022

Uh oh!

kolyshkin commented Mar 23, 2022

Uh oh!

k8s-triage-robot commented Mar 23, 2022

Uh oh!

k8s-triage-robot commented Mar 23, 2022

Uh oh!

k8s-ci-robot commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-triage-robot commented Mar 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

ehashman commented Mar 21, 2022 •

edited

Loading

bobbypage commented Mar 22, 2022 •

edited

Loading

bobbypage commented Mar 22, 2022 •

edited

Loading

k8s-ci-robot commented Mar 23, 2022 •

edited

Loading