Skip to content

Add kubelet serial test suite for cgroup v1 and cgroup v2#22290

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
harche:systemd-serial
May 24, 2021
Merged

Add kubelet serial test suite for cgroup v1 and cgroup v2#22290
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
harche:systemd-serial

Conversation

@harche
Copy link
Copy Markdown
Contributor

@harche harche commented May 21, 2021

This test suite adds kubelet node serial jobs that runs on the node which has crio + systemd + cgroup v1 and crio + systemd + cgroup v2 configuration

Signed-off-by: Harshal Patil [email protected]

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 21, 2021
@k8s-ci-robot k8s-ci-robot requested review from karan and sjenning May 21, 2021 06:39
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 21, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 21, 2021

/sig node

@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 21, 2021

/cc @odinuge

@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 21, 2021

I am running them manually in my local k8s clone using make test-e2e-node. Once we see them passing there, I will remove the hold on this PR.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 21, 2021

@odinuge I bumped up the timeout to 420m. But with such a big timeout, to me it doesn't make sense to keep this as a presubmit. Maybe for now we can keep it, because it's easier to debug and we don't want to keep spinning periodic job that we know is going to fail for sure. But once we fix all the tests, I think we should covert this job into a periodic that runs every 4 hours or so. WDYT?

Also, in my local tests the job was successfully launched. So I am going to remove the hold. We will fix the tests one by one.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2021
@harche harche changed the title WIP: Add kubelet serial test suite for cgroup v1 and cgroup v2 Add kubelet serial test suite for cgroup v1 and cgroup v2 May 21, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 21, 2021

For ref, the command used for invoking locally,

 export KUBE_SSH_USER=core && make test-e2e-node REMOTE=true RUNTIME=remote IMAGE_CONFIG_FILE="/home/harshal/Downloads/image-config-cgrpv1.yaml" INSTANCE_PREFIX="harpatil" CLEANUP=false CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/crio/crio.sock" FOCUS="\[Serial\]" SKIP="\[Flaky\]|\[Benchmark\]|\[NodeSpecialFeature:.+\]|\[NodeAlphaFeature:.+\]" PARALLELISM="1" TEST_ARGS='--kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service --non-masquerade-cidr=0.0.0.0/0 --feature-gates=DynamicKubeletConfig=true,LocalStorageCapacityIsolation=true" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"'

and a small hacky change,

$ git diff 
diff --git a/hack/e2e-node-test.sh b/hack/e2e-node-test.sh
index 94c2c7253ee..1047a60cb90 100755
--- a/hack/e2e-node-test.sh
+++ b/hack/e2e-node-test.sh
@@ -49,4 +49,4 @@ echo "The equivalent of this invocation is: "
 echo "    make test-e2e-node ${ARGHELP}"
 echo
 echo
-make --no-print-directory -C "${KUBE_ROOT}" test-e2e-node FOCUS="${FOCUS:-}" SKIP="${SKIP:-}"
+make --no-print-directory -C "${KUBE_ROOT}" test-e2e-node FOCUS="${FOCUS:-}" SKIP="${SKIP:-}" PARALLELISM="${PARALLELISM:-}"
diff --git a/test/e2e/framework/test_context.go b/test/e2e/framework/test_context.go
index cce5ad97fec..12d5083e595 100644
--- a/test/e2e/framework/test_context.go
+++ b/test/e2e/framework/test_context.go
@@ -298,7 +298,7 @@ func RegisterCommonFlags(flags *flag.FlagSet) {
        flags.StringVar(&TestContext.ReportPrefix, "report-prefix", "", "Optional prefix for JUnit XML reports. Default is empty, which doesn't prepend anything to the default name.")
        flags.StringVar(&TestContext.ReportDir, "report-dir", "", "Path to the directory where the JUnit XML reports should be saved. Default is empty, which doesn't generate these reports.")
        flags.Var(cliflag.NewMapStringBool(&TestContext.FeatureGates), "feature-gates", "A set of key=value pairs that describe feature gates for alpha/experimental features.")
-       flags.StringVar(&TestContext.ContainerRuntime, "container-runtime", "docker", "The container runtime of cluster VM instances (docker/remote).")
+       flags.StringVar(&TestContext.ContainerRuntime, "container-runtime", "remote", "The container runtime of cluster VM instances (docker/remote).")
        flags.StringVar(&TestContext.ContainerRuntimeEndpoint, "container-runtime-endpoint", "unix:///var/run/dockershim.sock", "The container runtime endpoint of cluster VM instances.")
        flags.StringVar(&TestContext.ContainerRuntimeProcessName, "container-runtime-process-name", "dockerd", "The name of the container runtime process.")
        flags.StringVar(&TestContext.ContainerRuntimePidFile, "container-runtime-pid-file", "/var/run/docker.pid", "The pid file of the container runtime.")

@odinuge
Copy link
Copy Markdown
Member

odinuge commented May 21, 2021

@odinuge I bumped up the timeout to 420m. But with such a big timeout, to me it doesn't make sense to keep this as a presubmit. Maybe for now we can keep it, because it's easier to debug and we don't want to keep spinning periodic job that we know is going to fail for sure. But once we fix all the tests, I think we should covert this job into a periodic that runs every 4 hours or so. WDYT?

👍

That sounds like a good plan! Yeah, these as pre submit is kinda overkill, and 420m is a looong time. When it starts passing without flakes, we can look at the tests that takes most time, and look at what to do with them.

Also, in my local tests the job was successfully launched. So I am going to remove the hold. We will fix the tests one by one.

Awesome!

Copy link
Copy Markdown
Member

@odinuge odinuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Awesome, thanks for working on this! 😄

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 21, 2021

@odinuge Just finished one local run, unfortunately I forgot to set the timeout in that run which resulted in the default timeout of 240m. So the job got killed while it was still running.

JFYI, while many tests were passing I saw one test case failure,

• Failure [300.058 seconds]
[sig-node] Container Manager Misc [Serial]
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/framework.go:23
  Validate OOM score adjustments [NodeFeature:OOMScoreAdj]
  _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:78
    once the node is setup
    _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:79
      container runtime's oom-score-adj should be -999 [It]
      _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:80

      Timed out after 300.001s.
      Expected
          <*errors.errorString | 0xc001bf8c10>: {
              s: "expected pid 2961's oom_score_adj to be -999; found 0",
          }
      to be nil

      _output/local/go/src/k8s.io/kubernetes/test/e2e_node/container_manager_test.go:86

But since it got killed after 240m we don't know what would have happened with other remaining test cases. This was on cgroup v1 btw.

@odinuge
Copy link
Copy Markdown
Member

odinuge commented May 21, 2021

@odinuge Just finished one local run, unfortunately I forgot to set the timeout in that run which resulted in the default timeout of 240m. So the job got killed while it was still running.

Ahh, I see. You can use the TIMEOUT="6h" to mitigate that.. 😅

JFYI, while many tests were passing I saw one test case failure,

Thanks!

Ooooh. Interesting. I have seen that a lot of oom logic has been changed lately, so maybe something has broken it (or it has been broken the whole time):

https://github.com/kubernetes/kubernetes/pulls?q=is%3Apr+oom+is%3Aclosed

Looking forward to getting these test suites up and running properly again!

@odinuge
Copy link
Copy Markdown
Member

odinuge commented May 21, 2021

In #21828 we changed to a bigger instance type for the serial tests. We should probably do that here as well, or is it already bigger than the default one (I am not that familiar with this)?

We can wait and see if there is a problem before we switch tho.

@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 21, 2021

In #21828 we changed to a bigger instance type for the serial tests. We should probably do that here as well, or is it already bigger than the default one (I am not that familiar with this)?

We can wait and see if there is a problem before we switch tho.

Oh, no I was running those tests with n1-standard-1 instance type. I will kill those ones now, and start again with n1-standard-2. Thanks for letting me know.

These jobs will run using crio and systemd

Signed-off-by: Harshal Patil <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021
@harche harche requested review from giuseppe and odinuge May 21, 2021 13:34
@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 22, 2021

After 6 hours of run using n1-standard-2 on cgroup v1 with systemd,

Summarizing 4 Failures:

[Fail] [sig-node] CriticalPod [Serial] [Disruptive] [NodeFeature:CriticalPod] when we need to admit a critical pod [It] should be able to create and delete a critical pod 
/home/harshal/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

[Fail] [sig-node] NodeProblemDetector [NodeFeature:NodeProblemDetector] [Serial] SystemLogMonitor [It] should generate node condition and events for corresponding errors 
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_problem_detector_linux.go:407

[Fail] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [It] Verifies the Kubelet device plugin functionality. 
/home/harshal/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

[Fail] [sig-node] PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive][NodeFeature:Eviction] when we run containers that should cause PIDPressure  [BeforeEach] should eventually evict all of the correct pods 
/home/harshal/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

Ran 24 of 196 Specs in 21612.181 seconds
FAIL! -- 19 Passed | 5 Failed | 0 Pending | 172 Skipped

Ginkgo ran 1 suite in 6h0m12.366034838s
Test Suite Failed

@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 24, 2021

@odinuge looking #22290 (comment) I think we should set the timeout at 6 hours at least, setting it any other value lower than that will result in timeout.

@odinuge
Copy link
Copy Markdown
Member

odinuge commented May 24, 2021

@odinuge looking #22290 (comment) I think we should set the timeout at 6 hours at least, setting it any other value lower than that will result in timeout.

That sounds like a good start! 😄
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 24, 2021

/cc @dims @derekwaynecarr @sjenning

@dims
Copy link
Copy Markdown
Member

dims commented May 24, 2021

/approve
/lgtm

🤞🏾

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, giuseppe, harche, odinuge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2021
@k8s-ci-robot k8s-ci-robot merged commit 60c03ad into kubernetes:master May 24, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 24, 2021
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@harche: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key sig-node-presubmit.yaml using file config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml
Details

In response to this:

This test suite adds kubelet node serial jobs that runs on the node which has crio + systemd + cgroup v1 and crio + systemd + cgroup v2 configuration

Signed-off-by: Harshal Patil [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants