Add kubelet serial test suite for cgroup v1 and cgroup v2 by harche · Pull Request #22290 · kubernetes/test-infra

harche · 2021-05-21T06:38:59Z

This test suite adds kubelet node serial jobs that runs on the node which has crio + systemd + cgroup v1 and crio + systemd + cgroup v2 configuration

Signed-off-by: Harshal Patil [email protected]

harche · 2021-05-21T06:39:19Z

/sig node

harche · 2021-05-21T06:40:32Z

/cc @odinuge

harche · 2021-05-21T07:01:13Z

I am running them manually in my local k8s clone using make test-e2e-node. Once we see them passing there, I will remove the hold on this PR.

/hold

harche · 2021-05-21T07:33:09Z

@odinuge I bumped up the timeout to 420m. But with such a big timeout, to me it doesn't make sense to keep this as a presubmit. Maybe for now we can keep it, because it's easier to debug and we don't want to keep spinning periodic job that we know is going to fail for sure. But once we fix all the tests, I think we should covert this job into a periodic that runs every 4 hours or so. WDYT?

Also, in my local tests the job was successfully launched. So I am going to remove the hold. We will fix the tests one by one.

/hold cancel

harche · 2021-05-21T07:37:16Z

For ref, the command used for invoking locally,

 export KUBE_SSH_USER=core && make test-e2e-node REMOTE=true RUNTIME=remote IMAGE_CONFIG_FILE="/home/harshal/Downloads/image-config-cgrpv1.yaml" INSTANCE_PREFIX="harpatil" CLEANUP=false CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/crio/crio.sock" FOCUS="\[Serial\]" SKIP="\[Flaky\]|\[Benchmark\]|\[NodeSpecialFeature:.+\]|\[NodeAlphaFeature:.+\]" PARALLELISM="1" TEST_ARGS='--kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service --non-masquerade-cidr=0.0.0.0/0 --feature-gates=DynamicKubeletConfig=true,LocalStorageCapacityIsolation=true" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"'

and a small hacky change,

$ git diff 
diff --git a/hack/e2e-node-test.sh b/hack/e2e-node-test.sh
index 94c2c7253ee..1047a60cb90 100755
--- a/hack/e2e-node-test.sh
+++ b/hack/e2e-node-test.sh
@@ -49,4 +49,4 @@ echo "The equivalent of this invocation is: "
 echo "    make test-e2e-node ${ARGHELP}"
 echo
 echo
-make --no-print-directory -C "${KUBE_ROOT}" test-e2e-node FOCUS="${FOCUS:-}" SKIP="${SKIP:-}"
+make --no-print-directory -C "${KUBE_ROOT}" test-e2e-node FOCUS="${FOCUS:-}" SKIP="${SKIP:-}" PARALLELISM="${PARALLELISM:-}"
diff --git a/test/e2e/framework/test_context.go b/test/e2e/framework/test_context.go
index cce5ad97fec..12d5083e595 100644
--- a/test/e2e/framework/test_context.go
+++ b/test/e2e/framework/test_context.go
@@ -298,7 +298,7 @@ func RegisterCommonFlags(flags *flag.FlagSet) {
        flags.StringVar(&TestContext.ReportPrefix, "report-prefix", "", "Optional prefix for JUnit XML reports. Default is empty, which doesn't prepend anything to the default name.")
        flags.StringVar(&TestContext.ReportDir, "report-dir", "", "Path to the directory where the JUnit XML reports should be saved. Default is empty, which doesn't generate these reports.")
        flags.Var(cliflag.NewMapStringBool(&TestContext.FeatureGates), "feature-gates", "A set of key=value pairs that describe feature gates for alpha/experimental features.")
-       flags.StringVar(&TestContext.ContainerRuntime, "container-runtime", "docker", "The container runtime of cluster VM instances (docker/remote).")
+       flags.StringVar(&TestContext.ContainerRuntime, "container-runtime", "remote", "The container runtime of cluster VM instances (docker/remote).")
        flags.StringVar(&TestContext.ContainerRuntimeEndpoint, "container-runtime-endpoint", "unix:///var/run/dockershim.sock", "The container runtime endpoint of cluster VM instances.")
        flags.StringVar(&TestContext.ContainerRuntimeProcessName, "container-runtime-process-name", "dockerd", "The name of the container runtime process.")
        flags.StringVar(&TestContext.ContainerRuntimePidFile, "container-runtime-pid-file", "/var/run/docker.pid", "The pid file of the container runtime.")

odinuge · 2021-05-21T08:48:52Z

@odinuge I bumped up the timeout to 420m. But with such a big timeout, to me it doesn't make sense to keep this as a presubmit. Maybe for now we can keep it, because it's easier to debug and we don't want to keep spinning periodic job that we know is going to fail for sure. But once we fix all the tests, I think we should covert this job into a periodic that runs every 4 hours or so. WDYT?

👍

That sounds like a good plan! Yeah, these as pre submit is kinda overkill, and 420m is a looong time. When it starts passing without flakes, we can look at the tests that takes most time, and look at what to do with them.

Also, in my local tests the job was successfully launched. So I am going to remove the hold. We will fix the tests one by one.

Awesome!

odinuge

/lgtm

Awesome, thanks for working on this! 😄

harche · 2021-05-21T08:55:29Z

@odinuge Just finished one local run, unfortunately I forgot to set the timeout in that run which resulted in the default timeout of 240m. So the job got killed while it was still running.

JFYI, while many tests were passing I saw one test case failure,

• Failure [300.058 seconds]
[sig-node] Container Manager Misc [Serial]
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/framework.go:23
  Validate OOM score adjustments [NodeFeature:OOMScoreAdj]
  _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:78
    once the node is setup
    _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:79
      container runtime's oom-score-adj should be -999 [It]
      _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:80

      Timed out after 300.001s.
      Expected
          <*errors.errorString | 0xc001bf8c10>: {
              s: "expected pid 2961's oom_score_adj to be -999; found 0",
          }
      to be nil

      _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:86

But since it got killed after 240m we don't know what would have happened with other remaining test cases. This was on cgroup v1 btw.

odinuge · 2021-05-21T08:58:49Z

@odinuge Just finished one local run, unfortunately I forgot to set the timeout in that run which resulted in the default timeout of 240m. So the job got killed while it was still running.

Ahh, I see. You can use the TIMEOUT="6h" to mitigate that.. 😅

JFYI, while many tests were passing I saw one test case failure,

Thanks!

Ooooh. Interesting. I have seen that a lot of oom logic has been changed lately, so maybe something has broken it (or it has been broken the whole time):

https://github.com/kubernetes/kubernetes/pulls?q=is%3Apr+oom+is%3Aclosed

Looking forward to getting these test suites up and running properly again!

odinuge · 2021-05-21T11:38:40Z

In #21828 we changed to a bigger instance type for the serial tests. We should probably do that here as well, or is it already bigger than the default one (I am not that familiar with this)?

We can wait and see if there is a problem before we switch tho.

harche · 2021-05-21T12:00:14Z

In #21828 we changed to a bigger instance type for the serial tests. We should probably do that here as well, or is it already bigger than the default one (I am not that familiar with this)?

We can wait and see if there is a problem before we switch tho.

Oh, no I was running those tests with n1-standard-1 instance type. I will kill those ones now, and start again with n1-standard-2. Thanks for letting me know.

These jobs will run using crio and systemd Signed-off-by: Harshal Patil <[email protected]>

harche · 2021-05-22T05:27:34Z

After 6 hours of run using n1-standard-2 on cgroup v1 with systemd,

Summarizing 4 Failures:

[Fail] [sig-node] CriticalPod [Serial] [Disruptive] [NodeFeature:CriticalPod] when we need to admit a critical pod [It] should be able to create and delete a critical pod 
/home/harshal/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

[Fail] [sig-node] NodeProblemDetector [NodeFeature:NodeProblemDetector] [Serial] SystemLogMonitor [It] should generate node condition and events for corresponding errors 
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_problem_detector_linux.go:407

[Fail] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [It] Verifies the Kubelet device plugin functionality. 
/home/harshal/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

[Fail] [sig-node] PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive][NodeFeature:Eviction] when we run containers that should cause PIDPressure  [BeforeEach] should eventually evict all of the correct pods 
/home/harshal/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

Ran 24 of 196 Specs in 21612.181 seconds
FAIL! -- 19 Passed | 5 Failed | 0 Pending | 172 Skipped

Ginkgo ran 1 suite in 6h0m12.366034838s
Test Suite Failed

harche · 2021-05-24T05:35:27Z

@odinuge looking #22290 (comment) I think we should set the timeout at 6 hours at least, setting it any other value lower than that will result in timeout.

odinuge · 2021-05-24T08:12:54Z

@odinuge looking #22290 (comment) I think we should set the timeout at 6 hours at least, setting it any other value lower than that will result in timeout.

That sounds like a good start! 😄
/lgtm

harche · 2021-05-24T08:58:16Z

/cc @dims @derekwaynecarr @sjenning

dims · 2021-05-24T10:10:15Z

/approve
/lgtm

🤞🏾

k8s-ci-robot · 2021-05-24T10:10:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, giuseppe, harche, odinuge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/jobs/kubernetes/sig-node/OWNERS~~ [dims]
~~jobs/e2e_node/OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2021-05-24T10:25:43Z

@harche: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

key sig-node-presubmit.yaml using file config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml

Details

In response to this:

This test suite adds kubelet node serial jobs that runs on the node which has crio + systemd + cgroup v1 and crio + systemd + cgroup v2 configuration

Signed-off-by: Harshal Patil [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 21, 2021

k8s-ci-robot requested review from karan and sjenning May 21, 2021 06:39

harche mentioned this pull request May 21, 2021

The whole node-kubelet-serial job is failing because of global timeout kubernetes/kubernetes#102148

Closed

k8s-ci-robot requested a review from odinuge May 21, 2021 06:40

harche mentioned this pull request May 21, 2021

[sig-node] Add presubmit for kubelet-serial #22284

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2021

harche changed the title ~~WIP: Add kubelet serial test suite for cgroup v1 and cgroup v2~~ Add kubelet serial test suite for cgroup v1 and cgroup v2 May 21, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2021

odinuge approved these changes May 21, 2021

View reviewed changes

k8s-ci-robot assigned odinuge May 21, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021

giuseppe approved these changes May 21, 2021

View reviewed changes

k8s-ci-robot assigned giuseppe May 21, 2021

Add kubelet serial test jobs for cgroup v1 and cgroup v2

40fd10a

These jobs will run using crio and systemd Signed-off-by: Harshal Patil <[email protected]>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021

harche requested review from giuseppe and odinuge May 21, 2021 13:34

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2021

k8s-ci-robot requested review from derekwaynecarr and dims May 24, 2021 08:58

k8s-ci-robot assigned dims May 24, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2021

k8s-ci-robot merged commit 60c03ad into kubernetes:master May 24, 2021

k8s-ci-robot added this to the v1.22 milestone May 24, 2021

Conversation

harche commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harche commented May 21, 2021

Uh oh!

harche commented May 21, 2021

Uh oh!

harche commented May 21, 2021

Uh oh!

harche commented May 21, 2021

Uh oh!

harche commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odinuge commented May 21, 2021

Uh oh!

odinuge left a comment

Choose a reason for hiding this comment

Uh oh!

harche commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odinuge commented May 21, 2021

Uh oh!

odinuge commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harche commented May 21, 2021

Uh oh!

harche commented May 22, 2021

Uh oh!

harche commented May 24, 2021

Uh oh!

odinuge commented May 24, 2021

Uh oh!

harche commented May 24, 2021

Uh oh!

dims commented May 24, 2021

Uh oh!

k8s-ci-robot commented May 24, 2021

Uh oh!

k8s-ci-robot commented May 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

harche commented May 21, 2021 •

edited

Loading

harche commented May 21, 2021 •

edited

Loading

harche commented May 21, 2021 •

edited

Loading

odinuge commented May 21, 2021 •

edited

Loading