Skip to content

kubelet/cm: refactor, prepare for runc 1.1 bump#108597

Merged
k8s-ci-robot merged 5 commits intokubernetes:masterfrom
kolyshkin:prepare-for-runc-1.1
Mar 23, 2022
Merged

kubelet/cm: refactor, prepare for runc 1.1 bump#108597
k8s-ci-robot merged 5 commits intokubernetes:masterfrom
kolyshkin:prepare-for-runc-1.1

Conversation

@kolyshkin
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This does some refactoring, simplification, and occasional fixing pkg/kubelet/cm,
preparing the code for runc 1.1 bump. Originally part of #107149.

Special notes for your reviewer:

Please review commit by commit. See individual commit messages for details.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

none

There's no need to call m.Update (which will create another instance of
libcontainer cgroup manager, convert all the resources and then set
them). All this is already done here, except for Set().

Signed-off-by: Kir Kolyshkin <[email protected]>
Commit ecd6361 added setting PidsLimit to Create and Update.

Commit bce9d5f added setting PidsLimit to m.toResources.

Now, PidsLimit is assigned twice.

Remove the duplicate.

Fixes: bce9d5f
Signed-off-by: Kir Kolyshkin <[email protected]>
Commit 79be8be made hugetlb settings optional if cgroup v2 is used and
hugetlb is not available, fixing issue 92933. Note at that time this was only
needed for v2, because for v1 the resources were set one-by-one, and only for
supported resources.

Commit d312ef7 switched the code to using Set from runc/libcontainer
cgroups manager, and expanded the check to cgroup v1 as well.

Move this check earlier, to inside m.toResources, so instead of
converting all hugetlb resources from ResourceConfig to libcontainers's
Resources.HugetlbLimit, and then setting it to nil, we can skip the
conversion entirely if hugetlb is not supported, thus not doing the work
that is not needed.

Signed-off-by: Kir Kolyshkin <[email protected]>
Instead of doing (almost) the same thing from the three different
methods (Create, Update, Destroy), move the functionality to
libctCgroupConfig, replacing updateSystemdCgroupInfo.

The needResources bool is needed because we do not need resources
during Destroy, so we skip the unneeded resource conversion.

Signed-off-by: Kir Kolyshkin <[email protected]>
Move the rl == nil check to before we dereference it.

Signed-off-by: Kir Kolyshkin <[email protected]>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 9, 2022
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and odinuge March 9, 2022 01:11
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 9, 2022
@kolyshkin kolyshkin changed the title pkg/kubelet/cm: refactor, prepare for runc 1.1 kubelet/cm: refactor, prepare for runc 1.1 bump Mar 9, 2022
@kolyshkin
Copy link
Copy Markdown
Contributor Author

@odinuge @ehashman PTAL

@pacoxu
Copy link
Copy Markdown
Member

pacoxu commented Mar 11, 2022

/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 11, 2022
@kolyshkin
Copy link
Copy Markdown
Contributor Author

/assign @dchen1107

@ehashman
Copy link
Copy Markdown
Member

/assign @mrunalp @ehashman

Copy link
Copy Markdown
Member

@odinuge odinuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @kolyshkin!

Looks like all my initial comments from the other PR is resolved here, so it looks good from my pov. I am a big fan of all the performance improvements here, as well as the cleanups.

Code looks good, but will prefer if someone like @ehashman can also take a quick look to validate this. Thanks

/lgtm

To be sure;
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
/test pull-kubernetes-node-kubelet-serial-hugepages
/test pull-kubernetes-node-kubelet-serial-memory-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager
/test pull-kubernetes-node-memoryqos-cgrpv2
/test pull-kubernetes-node-swap-fedora
/test pull-kubernetes-node-swap-fedora-serial
/test pull-kubernetes-node-swap-ubuntu-serial

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2022
@ehashman
Copy link
Copy Markdown
Member

ehashman commented Mar 21, 2022

Looking at tests:

  • pull-kubernetes-node-kubelet-serial-cpu-manager
    • failed to build
  • pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • pull-kubernetes-node-kubelet-serial-hugepages
    • failed to build
  • pull-kubernetes-node-kubelet-serial-memory-manager
    • failed to build
  • pull-kubernetes-node-kubelet-serial-topology-manager
    • failed to build
  • pull-kubernetes-node-memoryqos-cgrpv2
    • "The resource 'projects/k8s-infra-e2e-boskos-011/regions/us-west1/addresses/bootstrap-e2e-master-ip' was not found"
  • pull-kubernetes-node-swap-fedora
  • pull-kubernetes-node-swap-fedora-serial
  • pull-kubernetes-node-swap-ubuntu-serial
    • the only job with an actual test failure:
      E2eNode Suite: [sig-node] Container Manager Misc [Serial] Validate OOM score adjustments [NodeFeature:OOMScoreAdj] once the node is setup container runtime's oom-score-adj should be -999 { Failure test/e2e_node/container_manager_test.go:81 Timed out after 300.001s. Expected <*errors.errorString \| 0xc000b28000>: { s: "expected pid 777's oom_score_adj to be -999; found -500", } to be nil test/e2e_node/container_manager_test.go:87}

I know a lot of the presubmits aren't in great shape as they weren't a priority after the dockershim migration since they're only used occasionally. Trying to launch this many node tests simultaneously might have hit capacity issues. Let's see if we can focus on just a few critical ones for this PR.

/skip

@ehashman
Copy link
Copy Markdown
Member

Given that https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108597/pull-kubernetes-node-kubelet-serial-containerd/1505844761926307840/ passed I'm not super concerned by most of these failures, I think the jobs are just a bit neglected and we tried to run jobs we wouldn't normally.

I think the Ubuntu bug is a true failure. According to triage, we've only seen it on the runc 1.1 bump PRs. Is it deterministic?

/test pull-kubernetes-node-swap-ubuntu-serial

Failure cluster 5c6173278f01b3767cdf

Error text:
test/e2e_node/container_manager_test.go:81
Timed out after 300.004s.
Expected
    <*errors.errorString | 0xc0017ac050>: {
        s: "expected pid 777's oom_score_adj to be -999; found -500",
    }
to be nil
test/e2e_node/container_manager_test.go:87

Recent failures:

3/21/2022, 2:52:29 AM pr:pull-kubernetes-node-swap-ubuntu-serial
3/7/2022, 5:32:16 PM pr:pull-kubernetes-node-swap-ubuntu-serial

@bobbypage
Copy link
Copy Markdown
Member

bobbypage commented Mar 22, 2022

I'm a little confused on the pull-kubernetes-node-swap-ubuntu-serial failure. The test that is failing there is E2eNode Suite: [sig-node] Container Manager Misc [Serial] Validate OOM score adjustments [NodeFeature:OOMScoreAdj] once the node is setup container runtime's oom-score-adj should be -999. However, the same test is passing in the pull-kubernetes-node-kubelet-serial-containerd job (see https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108597/pull-kubernetes-node-kubelet-serial-containerd/1505844761926307840/).

Maybe this is something specific to setup of ubuntu swap job? It's weird to see the test only fail on swap job, I would expect it to either fail on both or neither :)
/cc @ehashman

@k8s-ci-robot k8s-ci-robot requested a review from ehashman March 22, 2022 19:48
@bobbypage
Copy link
Copy Markdown
Member

The swap job is failing because it does not set container-runtime-process-name so the swap test is incorrectly thinking the runtime is dockerd instead of containerd. (dockerd has -500 OOM score, dockerd has -999 which is what test expects).

The serial job in contrast correctly sets container-runtime-process-name so it doesn't have this problem. kubernetes/test-infra#25726 will fix the swap job.

@bobbypage
Copy link
Copy Markdown
Member

/test pull-kubernetes-node-swap-ubuntu-serial

@bobbypage
Copy link
Copy Markdown
Member

bobbypage commented Mar 22, 2022

/lgtm

LGTM on the code changes, thanks for all the refactoring, it's much cleaner now with less duplication of cgroup configuration.

The swap job is now passing as well after re-running with kubernetes/test-infra#25726 merged

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kolyshkin, mrunalp, odinuge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2022
Copy link
Copy Markdown
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

thanks for fixing the job!

@kolyshkin
Copy link
Copy Markdown
Contributor Author

Some weird errors, looks like due to go 1.18 being used by default. Let me try to rebase this.

@kolyshkin
Copy link
Copy Markdown
Contributor Author

I also think that maybe golangci-lint need to be updated to 1.45.0 because of go 1.18.

Ah, indeed, there's a PR opened already: #108901 to disable the jobs. I think I might know how to fix the issue, so here's #108902

@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

1 similar comment
@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

k8s-ci-robot commented Mar 23, 2022

@kolyshkin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-kubelet-serial-memory-manager de5a69d link false /test pull-kubernetes-node-kubelet-serial-memory-manager
pull-kubernetes-node-kubelet-serial-cpu-manager de5a69d link false /test pull-kubernetes-node-kubelet-serial-cpu-manager
pull-kubernetes-node-kubelet-serial-topology-manager de5a69d link false /test pull-kubernetes-node-kubelet-serial-topology-manager
pull-kubernetes-node-kubelet-serial-hugepages de5a69d link false /test pull-kubernetes-node-kubelet-serial-hugepages
pull-kubernetes-node-memoryqos-cgrpv2 de5a69d link false /test pull-kubernetes-node-memoryqos-cgrpv2
pull-kubernetes-node-kubelet-serial-crio-cgroupv2 de5a69d link false /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
pull-kubernetes-node-swap-fedora de5a69d link false /test pull-kubernetes-node-swap-fedora
pull-kubernetes-node-kubelet-serial-crio-cgroupv1 de5a69d link false /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
pull-kubernetes-node-swap-fedora-serial de5a69d link false /test pull-kubernetes-node-swap-fedora-serial

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Development

Successfully merging this pull request may close these issues.

9 participants