Presubmit cgroup v1 crio NodeConformance and NodeFeatures by harche · Pull Request #21278 · kubernetes/test-infra

harche · 2021-03-10T07:12:40Z

This PR will enable pull-kubernetes-node-crio-e2e presubmit job to run against every k8s PR on master branch but, it will be an optional job.

pull-kubernetes-node-crio-e2e runs NodeConformance and NodeFeatures node e2e tests using cgroup v1.

Signed-off-by: Harshal Patil [email protected]

harche · 2021-03-10T07:13:08Z

/cc @dims @sjenning

harche · 2021-03-10T11:47:44Z

/hold we seem to be hitting a small issue in CRIO CI while downloading the static binaries.

harche · 2021-03-10T11:51:28Z

I will remove the hold after cri-o/cri-o#4642 is merged.

dims

please add skip_report: true so it does not show up in the github UI. let's run it for a few weeks like that and then turn remove this.

also, is the a equivalent CI job already? if not let's please add one.

harche · 2021-03-10T14:30:48Z

please add skip_report: true so it does not show up in the github UI. let's run it for a few weeks like that and then turn remove this.

also, is the a equivalent CI job already? if not let's please add one.

Thanks @dims. I added skip_report: true.

We do have equivalent periodic jobs, https://prow.k8s.io/?job=*ci-crio-cgroupv1-node-e2e-*

ci-crio-cgroupv1-node-e2e-features tests NodeFeatures e2e tests every hour using k8s master
ci-crio-cgroupv1-node-e2e-conformance tests NodeConformance e2e tests every hour using k8s master

Unfortunately those tests have been failing from last few hours due to an issue in CRIO CI. But the fix is on the way, cri-o/cri-o#4642 . In fact, I have put /hold on this PR for the same reason.

dims · 2021-03-10T15:02:55Z

/approve

will lgtm once the ci jobs are green

harche · 2021-03-11T04:27:45Z

I will remove the hold after cri-o/cri-o#4642 is merged.

/hold cancel

harche · 2021-03-11T05:50:46Z

/approve

will lgtm once the ci jobs are green

@dims the ci job are back to green again, https://prow.k8s.io/?job=*ci-crio-cgroupv1-node-e2e-*

harche · 2021-03-11T05:51:58Z

I didn't quite get why pull-test-infra-bazel is failing, any clue? @SergeyKanzhelev @dims

dims · 2021-03-11T12:08:33Z

/retest
/lgtm

🤞🏾 that this is a flake!

harche · 2021-03-11T13:43:10Z

/retest

BenTheElder · 2021-03-11T16:39:39Z

FAIL: TestPresubmits (0.03s)
jobs_test.go:93: Job pull-kubernetes-node-crio-e2e : Cannot have both branches and skip_branches set

BenTheElder · 2021-03-11T16:41:21Z

      testgrid-dashboards: sig-node-cri-o
      testgrid-tab-name: pr-crio-gce-e2e
-    always_run: false
+    always_run: true


@dims do we really need another always_run? We're trying to reduce the amount of presubmits we rely on because:

they add noise to contributors when flaking

presubmits cannot consume a shared build unlike periodics after merge, we build kubelet etc. N times over on every push which is expensive.

@BenTheElder good points. May be we can add a regex to limit how many times this runs? (limit to files in kubelet that affects the CRI stuff?)

/hold

For context a typical Kubernetes PR already has 13+ always_run. But we've been eliminating them where possible #18729
Not all of these do an e2e build, but at least 6 of them do. I already am looking to see what we can kick out, and this doesn't include some additional run_if_changed jobs we trigger on a lot of PRs now.

Since the dockershim is getting deprecated, that means only containerd will be used for node e2e tests. We want to make sure more than one CRI runtime is tested for node e2e tests.

That's fair, that does not however mean both need to be tested on every single push to Kubernetes versus in periodic / postsubmit (as discussed in #sig-node). If we tried to test the matrix of C*I on every push it would get out of hand.

If we need this in presubmit because it is constantly broken in CI without it, I think it makes more sense to rework node e2e to allow a single job with N VMs (ideally in parallel) that do not necessarily all have the same runtime, so we can add fedora/cri-o/cgroupsv2 as additional VMs with the same build, runner, etc. (with lower overhead than copy + pasting the entire job and tweaking a field)

My first idea was to do just that, modify existing pull-kubernetes-node-e2e add more VMs for fedora-crio-cgrouv1 (and in future, fedora-crio-cgroupv2). That way we don't end up adding one more job. But --node-test-args for docker is different than crio.

I will explore if we could set the required kubelet args in https://github.com/kubernetes/test-infra/blob/master/jobs/e2e_node/image-config.yaml#L11 using metadata field.

Also, considering dockershim is getting deprecated we will have to run node e2e jobs using runtime type remote eventually anyway. So this could be a replacement for the existing node e2e test which only use docker.

BenTheElder · 2021-03-11T19:28:45Z

the gubernator failure is still real, but removing my hold after discussing with @ehashman @harche
I need to run for a bit, so I'll leave it at that, but intent is to post back a summary as well 😅
/hold cancel

ehashman · 2021-03-11T19:28:45Z

/retest

ehashman · 2021-03-11T19:29:05Z

will leave it to @harche to fix the failure :)

ehashman · 2021-03-11T19:38:58Z

Also added to next week's agenda for the CI subproject https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.juhhlvrt46z7

Previously discussed in the 2020-03-31 and 2020-06-16 meetings https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.d9zp2j5jvkke

fejta-bot · 2021-03-11T23:43:19Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

BenTheElder · 2021-03-12T00:07:04Z

/lgtm cancel

this is not a flake #21278 (comment), the failure is hidden in the full log (Expand the log lines)

Gubernator configuration file is out of sync!

Run gubernator/update_config.sh

(cancel LGTM => no more auto-retest, and it will need a re-lgtm when the fix is pushed anyhow)

Signed-off-by: Harshal Patil <[email protected]>

harche · 2021-03-12T03:40:01Z

/lgtm cancel

this is not a flake #21278 (comment), the failure is hidden in the full log (Expand the log lines)

Gubernator configuration file is out of sync!

Run gubernator/update_config.sh

(cancel LGTM => no more auto-retest, and it will need a re-lgtm when the fix is pushed anyhow)

Thanks, fixed it.

ehashman

/lgtm

ehashman · 2021-03-13T01:52:07Z

/assign @spiffxp

PTAL :)

spiffxp

/approve
/lgtm
I good with this given the understanding reached in #21278 (review) (extra job for now, fold back into node-e2e later)

spiffxp · 2021-03-17T17:24:31Z

      testgrid-tab-name: pr-crio-gce-e2e
-    always_run: false
+    always_run: true
+    skip_report: true


I would be OK with this being false, you might find it's a pain to bounce from PR to job, but that can be followup if you care.

Once we get some signal we plan on reporting :)

k8s-ci-robot · 2021-03-17T17:30:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, ehashman, harche, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/jobs/kubernetes/sig-node/OWNERS~~ [dims,spiffxp]
~~gubernator/OWNERS~~ [spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2021-03-17T17:47:24Z

@harche: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

key sig-node-presubmit.yaml using file config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml

Details

In response to this:

This PR will enable pull-kubernetes-node-crio-e2e presubmit job to run against every k8s PR on master branch but, it will be an optional job.

pull-kubernetes-node-crio-e2e runs NodeConformance and NodeFeatures node e2e tests using cgroup v1.

Signed-off-by: Harshal Patil [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ehashman · 2021-05-14T19:38:29Z

@harche are we ready to promote this to reporting/blocking?

harche · 2021-05-18T10:54:57Z

@ehashman This was job was pretty stable but the latest update to runc has broken it, kubernetes/kubernetes#102092. So once some hotfix goes in the runc and that makes it into k8s vendor we can consider making this reporting/blocking.

The other issue is, we were looking into merging this crio presubmit job with the existing docker based node conformance job. But the existing blocking node e2e conformance test job passes node args that are specific to docker. But any remote runtime, such as crio, will require slightly different set of node args. @SergeyKanzhelev @MHBauer I was wondering if there is way to pass the node test args via image config file?

@ehashman if passiging node test args via the image config file is not feasible then we might have to consider making the existing job reporting/blocking instead of merging it with the existing docker based job. Making the existing job reporting/blocking is fairly simply task, we just have to flip the boolean.

ehashman · 2021-05-18T16:08:50Z

@ehashman This was job was pretty stable but the latest update to runc has broken it, kubernetes/kubernetes#102092. So once some hotfix goes in the runc and that makes it into k8s vendor we can consider making this reporting/blocking.

Okay, once fixed let's make this reporting.

The other issue is, we were looking into merging this crio presubmit job with the existing docker based node conformance job. But the existing blocking node e2e conformance test job passes node args that are specific to docker. But any remote runtime, such as crio, will require slightly different set of node args. @SergeyKanzhelev @MHBauer I was wondering if there is way to pass the node test args via image config file?

My understanding is that this was not a blocker to making this a blocking job, it's just a long-term plan.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 10, 2021

k8s-ci-robot requested review from MHBauer and bart0sh March 10, 2021 07:12

k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 10, 2021

k8s-ci-robot requested review from dims and sjenning March 10, 2021 07:13

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2021

dims requested changes Mar 10, 2021

View reviewed changes

k8s-ci-robot assigned dims Mar 10, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021

BenTheElder reviewed Mar 11, 2021

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 11, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2021

k8s-ci-robot assigned ehashman Mar 11, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021

k8s-ci-robot assigned BenTheElder Mar 12, 2021

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2021

Presubmit cgroup v1 crio NodeConformance and NodeFeatures

72324b2

Signed-off-by: Harshal Patil <[email protected]>

k8s-ci-robot added area/gubernator Issues or PRs related to code in /gubernator and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 12, 2021

ehashman reviewed Mar 13, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2021

k8s-ci-robot assigned spiffxp Mar 13, 2021

spiffxp approved these changes Mar 17, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2021

k8s-ci-robot merged commit 927be10 into kubernetes:master Mar 17, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 17, 2021

harche deleted the always_run branch March 25, 2021 13:35

harche mentioned this pull request May 18, 2021

vendor: bump runc to rc94 kubernetes/kubernetes#101888

Merged

Conversation

harche commented Mar 10, 2021

Uh oh!

harche commented Mar 10, 2021

Uh oh!

harche commented Mar 10, 2021

Uh oh!

harche commented Mar 10, 2021

Uh oh!

dims left a comment

Choose a reason for hiding this comment

Uh oh!

harche commented Mar 10, 2021

Uh oh!

dims commented Mar 10, 2021

Uh oh!

harche commented Mar 11, 2021

Uh oh!

harche commented Mar 11, 2021

Uh oh!

harche commented Mar 11, 2021

Uh oh!

dims commented Mar 11, 2021

Uh oh!

harche commented Mar 11, 2021

Uh oh!

BenTheElder commented Mar 11, 2021

Uh oh!

BenTheElder Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

dims Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

BenTheElder Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harche Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

BenTheElder Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

harche Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Mar 11, 2021

Uh oh!

ehashman commented Mar 11, 2021

Uh oh!

ehashman commented Mar 11, 2021

Uh oh!

ehashman commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fejta-bot commented Mar 11, 2021

Uh oh!

BenTheElder commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harche commented Mar 12, 2021

Uh oh!

ehashman left a comment

Choose a reason for hiding this comment

Uh oh!

ehashman commented Mar 13, 2021

Uh oh!

spiffxp left a comment

Choose a reason for hiding this comment

Uh oh!

spiffxp Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

ehashman Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 17, 2021

Uh oh!

k8s-ci-robot commented Mar 17, 2021

BenTheElder Mar 11, 2021 •

edited

Loading

ehashman commented Mar 11, 2021 •

edited

Loading

BenTheElder commented Mar 12, 2021 •

edited

Loading