Skip to content

Presubmit cgroup v1 crio NodeConformance and NodeFeatures#21278

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
harche:always_run
Mar 17, 2021
Merged

Presubmit cgroup v1 crio NodeConformance and NodeFeatures#21278
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
harche:always_run

Conversation

@harche
Copy link
Copy Markdown
Contributor

@harche harche commented Mar 10, 2021

This PR will enable pull-kubernetes-node-crio-e2e presubmit job to run against every k8s PR on master branch but, it will be an optional job.

pull-kubernetes-node-crio-e2e runs NodeConformance and NodeFeatures node e2e tests using cgroup v1.

Signed-off-by: Harshal Patil [email protected]

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 10, 2021
@k8s-ci-robot k8s-ci-robot requested review from MHBauer and bart0sh March 10, 2021 07:12
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 10, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 10, 2021

/cc @dims @sjenning

@k8s-ci-robot k8s-ci-robot requested review from dims and sjenning March 10, 2021 07:13
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 10, 2021

/hold we seem to be hitting a small issue in CRIO CI while downloading the static binaries.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 10, 2021

I will remove the hold after cri-o/cri-o#4642 is merged.

Copy link
Copy Markdown
Member

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add skip_report: true so it does not show up in the github UI. let's run it for a few weeks like that and then turn remove this.

also, is the a equivalent CI job already? if not let's please add one.

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 10, 2021

please add skip_report: true so it does not show up in the github UI. let's run it for a few weeks like that and then turn remove this.

also, is the a equivalent CI job already? if not let's please add one.

Thanks @dims. I added skip_report: true.

We do have equivalent periodic jobs, https://prow.k8s.io/?job=*ci-crio-cgroupv1-node-e2e-*

ci-crio-cgroupv1-node-e2e-features tests NodeFeatures e2e tests every hour using k8s master
ci-crio-cgroupv1-node-e2e-conformance tests NodeConformance e2e tests every hour using k8s master

Unfortunately those tests have been failing from last few hours due to an issue in CRIO CI. But the fix is on the way, cri-o/cri-o#4642 . In fact, I have put /hold on this PR for the same reason.

@dims
Copy link
Copy Markdown
Member

dims commented Mar 10, 2021

/approve

will lgtm once the ci jobs are green

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 11, 2021

I will remove the hold after cri-o/cri-o#4642 is merged.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 11, 2021

/approve

will lgtm once the ci jobs are green

@dims the ci job are back to green again, https://prow.k8s.io/?job=*ci-crio-cgroupv1-node-e2e-*

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 11, 2021

I didn't quite get why pull-test-infra-bazel is failing, any clue? @SergeyKanzhelev @dims

@dims
Copy link
Copy Markdown
Member

dims commented Mar 11, 2021

/retest
/lgtm

🤞🏾 that this is a flake!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 11, 2021

/retest

@BenTheElder
Copy link
Copy Markdown
Member

FAIL: TestPresubmits (0.03s)
jobs_test.go:93: Job pull-kubernetes-node-crio-e2e : Cannot have both branches and skip_branches set

testgrid-dashboards: sig-node-cri-o
testgrid-tab-name: pr-crio-gce-e2e
always_run: false
always_run: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dims do we really need another always_run? We're trying to reduce the amount of presubmits we rely on because:

  • they add noise to contributors when flaking
  • presubmits cannot consume a shared build unlike periodics after merge, we build kubelet etc. N times over on every push which is expensive.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenTheElder good points. May be we can add a regex to limit how many times this runs? (limit to files in kubelet that affects the CRI stuff?)

/hold

Copy link
Copy Markdown
Member

@BenTheElder BenTheElder Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context a typical Kubernetes PR already has 13+ always_run. But we've been eliminating them where possible #18729
Not all of these do an e2e build, but at least 6 of them do. I already am looking to see what we can kick out, and this doesn't include some additional run_if_changed jobs we trigger on a lot of PRs now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the dockershim is getting deprecated, that means only containerd will be used for node e2e tests. We want to make sure more than one CRI runtime is tested for node e2e tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, that does not however mean both need to be tested on every single push to Kubernetes versus in periodic / postsubmit (as discussed in #sig-node). If we tried to test the matrix of C*I on every push it would get out of hand.

If we need this in presubmit because it is constantly broken in CI without it, I think it makes more sense to rework node e2e to allow a single job with N VMs (ideally in parallel) that do not necessarily all have the same runtime, so we can add fedora/cri-o/cgroupsv2 as additional VMs with the same build, runner, etc. (with lower overhead than copy + pasting the entire job and tweaking a field)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first idea was to do just that, modify existing pull-kubernetes-node-e2e add more VMs for fedora-crio-cgrouv1 (and in future, fedora-crio-cgroupv2). That way we don't end up adding one more job. But --node-test-args for docker is different than crio.

I will explore if we could set the required kubelet args in https://github.com/kubernetes/test-infra/blob/master/jobs/e2e_node/image-config.yaml#L11 using metadata field.

Also, considering dockershim is getting deprecated we will have to run node e2e jobs using runtime type remote eventually anyway. So this could be a replacement for the existing node e2e test which only use docker.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 11, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 11, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021
@BenTheElder
Copy link
Copy Markdown
Member

the gubernator failure is still real, but removing my hold after discussing with @ehashman @harche
I need to run for a bit, so I'll leave it at that, but intent is to post back a summary as well 😅
/hold cancel

@ehashman
Copy link
Copy Markdown
Member

/retest

@ehashman
Copy link
Copy Markdown
Member

will leave it to @harche to fix the failure :)

@ehashman
Copy link
Copy Markdown
Member

ehashman commented Mar 11, 2021

@fejta-bot
Copy link
Copy Markdown

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@BenTheElder
Copy link
Copy Markdown
Member

BenTheElder commented Mar 12, 2021

/lgtm cancel

this is not a flake #21278 (comment), the failure is hidden in the full log (Expand the log lines)

Gubernator configuration file is out of sync!

Run gubernator/update_config.sh

(cancel LGTM => no more auto-retest, and it will need a re-lgtm when the fix is pushed anyhow)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2021
@k8s-ci-robot k8s-ci-robot added area/gubernator Issues or PRs related to code in /gubernator and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 12, 2021
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Mar 12, 2021

/lgtm cancel

this is not a flake #21278 (comment), the failure is hidden in the full log (Expand the log lines)

Gubernator configuration file is out of sync!

Run gubernator/update_config.sh

(cancel LGTM => no more auto-retest, and it will need a re-lgtm when the fix is pushed anyhow)

Thanks, fixed it.

Copy link
Copy Markdown
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2021
@ehashman
Copy link
Copy Markdown
Member

/assign @spiffxp

PTAL :)

Copy link
Copy Markdown
Contributor

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm
I good with this given the understanding reached in #21278 (review) (extra job for now, fold back into node-e2e later)

testgrid-tab-name: pr-crio-gce-e2e
always_run: false
always_run: true
skip_report: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with this being false, you might find it's a pain to bounce from PR to job, but that can be followup if you care.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we get some signal we plan on reporting :)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, ehashman, harche, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2021
@k8s-ci-robot k8s-ci-robot merged commit 927be10 into kubernetes:master Mar 17, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 17, 2021
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@harche: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key sig-node-presubmit.yaml using file config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml
Details

In response to this:

This PR will enable pull-kubernetes-node-crio-e2e presubmit job to run against every k8s PR on master branch but, it will be an optional job.

pull-kubernetes-node-crio-e2e runs NodeConformance and NodeFeatures node e2e tests using cgroup v1.

Signed-off-by: Harshal Patil [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@harche harche deleted the always_run branch March 25, 2021 13:35
@ehashman
Copy link
Copy Markdown
Member

@harche are we ready to promote this to reporting/blocking?

@harche
Copy link
Copy Markdown
Contributor Author

harche commented May 18, 2021

@ehashman This was job was pretty stable but the latest update to runc has broken it, kubernetes/kubernetes#102092. So once some hotfix goes in the runc and that makes it into k8s vendor we can consider making this reporting/blocking.

The other issue is, we were looking into merging this crio presubmit job with the existing docker based node conformance job. But the existing blocking node e2e conformance test job passes node args that are specific to docker. But any remote runtime, such as crio, will require slightly different set of node args. @SergeyKanzhelev @MHBauer I was wondering if there is way to pass the node test args via image config file?

@ehashman if passiging node test args via the image config file is not feasible then we might have to consider making the existing job reporting/blocking instead of merging it with the existing docker based job. Making the existing job reporting/blocking is fairly simply task, we just have to flip the boolean.

@ehashman
Copy link
Copy Markdown
Member

@ehashman This was job was pretty stable but the latest update to runc has broken it, kubernetes/kubernetes#102092. So once some hotfix goes in the runc and that makes it into k8s vendor we can consider making this reporting/blocking.

Okay, once fixed let's make this reporting.

The other issue is, we were looking into merging this crio presubmit job with the existing docker based node conformance job. But the existing blocking node e2e conformance test job passes node args that are specific to docker. But any remote runtime, such as crio, will require slightly different set of node args. @SergeyKanzhelev @MHBauer I was wondering if there is way to pass the node test args via image config file?

My understanding is that this was not a blocker to making this a blocking job, it's just a long-term plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/gubernator Issues or PRs related to code in /gubernator area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants