Add kubelet serial test suite for cgroup v1 and cgroup v2#22290
Add kubelet serial test suite for cgroup v1 and cgroup v2#22290k8s-ci-robot merged 1 commit intokubernetes:masterfrom harche:systemd-serial
Conversation
|
/sig node |
|
/cc @odinuge |
|
I am running them manually in my local k8s clone using /hold |
|
@odinuge I bumped up the timeout to 420m. But with such a big timeout, to me it doesn't make sense to keep this as a presubmit. Maybe for now we can keep it, because it's easier to debug and we don't want to keep spinning periodic job that we know is going to fail for sure. But once we fix all the tests, I think we should covert this job into a periodic that runs every 4 hours or so. WDYT? Also, in my local tests the job was successfully launched. So I am going to remove the hold. We will fix the tests one by one. /hold cancel |
|
For ref, the command used for invoking locally, export KUBE_SSH_USER=core && make test-e2e-node REMOTE=true RUNTIME=remote IMAGE_CONFIG_FILE="/home/harshal/Downloads/image-config-cgrpv1.yaml" INSTANCE_PREFIX="harpatil" CLEANUP=false CONTAINER_RUNTIME_ENDPOINT="unix:///var/run/crio/crio.sock" FOCUS="\[Serial\]" SKIP="\[Flaky\]|\[Benchmark\]|\[NodeSpecialFeature:.+\]|\[NodeAlphaFeature:.+\]" PARALLELISM="1" TEST_ARGS='--kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service --non-masquerade-cidr=0.0.0.0/0 --feature-gates=DynamicKubeletConfig=true,LocalStorageCapacityIsolation=true" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"'and a small hacky change, $ git diff
diff --git a/hack/e2e-node-test.sh b/hack/e2e-node-test.sh
index 94c2c7253ee..1047a60cb90 100755
--- a/hack/e2e-node-test.sh
+++ b/hack/e2e-node-test.sh
@@ -49,4 +49,4 @@ echo "The equivalent of this invocation is: "
echo " make test-e2e-node ${ARGHELP}"
echo
echo
-make --no-print-directory -C "${KUBE_ROOT}" test-e2e-node FOCUS="${FOCUS:-}" SKIP="${SKIP:-}"
+make --no-print-directory -C "${KUBE_ROOT}" test-e2e-node FOCUS="${FOCUS:-}" SKIP="${SKIP:-}" PARALLELISM="${PARALLELISM:-}"
diff --git a/test/e2e/framework/test_context.go b/test/e2e/framework/test_context.go
index cce5ad97fec..12d5083e595 100644
--- a/test/e2e/framework/test_context.go
+++ b/test/e2e/framework/test_context.go
@@ -298,7 +298,7 @@ func RegisterCommonFlags(flags *flag.FlagSet) {
flags.StringVar(&TestContext.ReportPrefix, "report-prefix", "", "Optional prefix for JUnit XML reports. Default is empty, which doesn't prepend anything to the default name.")
flags.StringVar(&TestContext.ReportDir, "report-dir", "", "Path to the directory where the JUnit XML reports should be saved. Default is empty, which doesn't generate these reports.")
flags.Var(cliflag.NewMapStringBool(&TestContext.FeatureGates), "feature-gates", "A set of key=value pairs that describe feature gates for alpha/experimental features.")
- flags.StringVar(&TestContext.ContainerRuntime, "container-runtime", "docker", "The container runtime of cluster VM instances (docker/remote).")
+ flags.StringVar(&TestContext.ContainerRuntime, "container-runtime", "remote", "The container runtime of cluster VM instances (docker/remote).")
flags.StringVar(&TestContext.ContainerRuntimeEndpoint, "container-runtime-endpoint", "unix:///var/run/dockershim.sock", "The container runtime endpoint of cluster VM instances.")
flags.StringVar(&TestContext.ContainerRuntimeProcessName, "container-runtime-process-name", "dockerd", "The name of the container runtime process.")
flags.StringVar(&TestContext.ContainerRuntimePidFile, "container-runtime-pid-file", "/var/run/docker.pid", "The pid file of the container runtime.") |
👍 That sounds like a good plan! Yeah, these as pre submit is kinda overkill, and 420m is a looong time. When it starts passing without flakes, we can look at the tests that takes most time, and look at what to do with them.
Awesome! |
odinuge
left a comment
There was a problem hiding this comment.
/lgtm
Awesome, thanks for working on this! 😄
|
@odinuge Just finished one local run, unfortunately I forgot to set the timeout in that run which resulted in the default timeout of 240m. So the job got killed while it was still running. JFYI, while many tests were passing I saw one test case failure, But since it got killed after 240m we don't know what would have happened with other remaining test cases. This was on cgroup v1 btw. |
Ahh, I see. You can use the
Thanks! Ooooh. Interesting. I have seen that a lot of oom logic has been changed lately, so maybe something has broken it (or it has been broken the whole time): Looking forward to getting these test suites up and running properly again! |
|
In #21828 we changed to a bigger instance type for the serial tests. We should probably do that here as well, or is it already bigger than the default one (I am not that familiar with this)? We can wait and see if there is a problem before we switch tho. |
Oh, no I was running those tests with |
These jobs will run using crio and systemd Signed-off-by: Harshal Patil <[email protected]>
|
After 6 hours of run using |
|
@odinuge looking #22290 (comment) I think we should set the timeout at 6 hours at least, setting it any other value lower than that will result in timeout. |
|
/approve 🤞🏾 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, giuseppe, harche, odinuge The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@harche: Updated the
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This test suite adds kubelet node serial jobs that runs on the node which has
crio + systemd + cgroup v1andcrio + systemd + cgroup v2configurationSigned-off-by: Harshal Patil [email protected]