Presubmit cgroup v1 crio NodeConformance and NodeFeatures#21278
Presubmit cgroup v1 crio NodeConformance and NodeFeatures#21278k8s-ci-robot merged 1 commit intokubernetes:masterfrom harche:always_run
Conversation
|
/hold we seem to be hitting a small issue in CRIO CI while downloading the static binaries. |
|
I will remove the hold after cri-o/cri-o#4642 is merged. |
dims
left a comment
There was a problem hiding this comment.
please add skip_report: true so it does not show up in the github UI. let's run it for a few weeks like that and then turn remove this.
also, is the a equivalent CI job already? if not let's please add one.
Thanks @dims. I added We do have equivalent periodic jobs, https://prow.k8s.io/?job=*ci-crio-cgroupv1-node-e2e-*
Unfortunately those tests have been failing from last few hours due to an issue in CRIO CI. But the fix is on the way, cri-o/cri-o#4642 . In fact, I have put |
|
/approve will lgtm once the ci jobs are green |
/hold cancel |
@dims the ci job are back to green again, https://prow.k8s.io/?job=*ci-crio-cgroupv1-node-e2e-* |
|
I didn't quite get why |
|
/retest 🤞🏾 that this is a flake! |
|
/retest |
|
| testgrid-dashboards: sig-node-cri-o | ||
| testgrid-tab-name: pr-crio-gce-e2e | ||
| always_run: false | ||
| always_run: true |
There was a problem hiding this comment.
@dims do we really need another always_run? We're trying to reduce the amount of presubmits we rely on because:
- they add noise to contributors when flaking
- presubmits cannot consume a shared build unlike periodics after merge, we build kubelet etc. N times over on every push which is expensive.
There was a problem hiding this comment.
@BenTheElder good points. May be we can add a regex to limit how many times this runs? (limit to files in kubelet that affects the CRI stuff?)
/hold
There was a problem hiding this comment.
For context a typical Kubernetes PR already has 13+ always_run. But we've been eliminating them where possible #18729
Not all of these do an e2e build, but at least 6 of them do. I already am looking to see what we can kick out, and this doesn't include some additional run_if_changed jobs we trigger on a lot of PRs now.
There was a problem hiding this comment.
Since the dockershim is getting deprecated, that means only containerd will be used for node e2e tests. We want to make sure more than one CRI runtime is tested for node e2e tests.
There was a problem hiding this comment.
That's fair, that does not however mean both need to be tested on every single push to Kubernetes versus in periodic / postsubmit (as discussed in #sig-node). If we tried to test the matrix of C*I on every push it would get out of hand.
If we need this in presubmit because it is constantly broken in CI without it, I think it makes more sense to rework node e2e to allow a single job with N VMs (ideally in parallel) that do not necessarily all have the same runtime, so we can add fedora/cri-o/cgroupsv2 as additional VMs with the same build, runner, etc. (with lower overhead than copy + pasting the entire job and tweaking a field)
There was a problem hiding this comment.
My first idea was to do just that, modify existing pull-kubernetes-node-e2e add more VMs for fedora-crio-cgrouv1 (and in future, fedora-crio-cgroupv2). That way we don't end up adding one more job. But --node-test-args for docker is different than crio.
I will explore if we could set the required kubelet args in https://github.com/kubernetes/test-infra/blob/master/jobs/e2e_node/image-config.yaml#L11 using metadata field.
Also, considering dockershim is getting deprecated we will have to run node e2e jobs using runtime type remote eventually anyway. So this could be a replacement for the existing node e2e test which only use docker.
|
/retest |
|
will leave it to @harche to fix the failure :) |
|
Also added to next week's agenda for the CI subproject https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.juhhlvrt46z7 Previously discussed in the 2020-03-31 and 2020-06-16 meetings https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.d9zp2j5jvkke |
|
/retest Review the full test history for this PR. Silence the bot with an |
|
/lgtm cancel this is not a flake #21278 (comment), the failure is hidden in the full log (Expand the log lines)
(cancel LGTM => no more auto-retest, and it will need a re-lgtm when the fix is pushed anyhow) |
Signed-off-by: Harshal Patil <[email protected]>
Thanks, fixed it. |
|
/assign @spiffxp PTAL :) |
spiffxp
left a comment
There was a problem hiding this comment.
/approve
/lgtm
I good with this given the understanding reached in #21278 (review) (extra job for now, fold back into node-e2e later)
| testgrid-tab-name: pr-crio-gce-e2e | ||
| always_run: false | ||
| always_run: true | ||
| skip_report: true |
There was a problem hiding this comment.
I would be OK with this being false, you might find it's a pain to bounce from PR to job, but that can be followup if you care.
There was a problem hiding this comment.
Once we get some signal we plan on reporting :)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, ehashman, harche, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@harche: Updated the
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@harche are we ready to promote this to reporting/blocking? |
|
@ehashman This was job was pretty stable but the latest update to runc has broken it, kubernetes/kubernetes#102092. So once some hotfix goes in the runc and that makes it into k8s vendor we can consider making this reporting/blocking. The other issue is, we were looking into merging this crio presubmit job with the existing docker based node conformance job. But the existing blocking node e2e conformance test job passes node args that are specific to docker. But any remote runtime, such as crio, will require slightly different set of node args. @SergeyKanzhelev @MHBauer I was wondering if there is way to pass the node test args via image config file? @ehashman if passiging node test args via the image config file is not feasible then we might have to consider making the existing job reporting/blocking instead of merging it with the existing docker based job. Making the existing job reporting/blocking is fairly simply task, we just have to flip the boolean. |
Okay, once fixed let's make this reporting.
My understanding is that this was not a blocker to making this a blocking job, it's just a long-term plan. |
This PR will enable
pull-kubernetes-node-crio-e2epresubmit job to run against every k8s PR on master branch but, it will be anoptionaljob.pull-kubernetes-node-crio-e2erunsNodeConformanceandNodeFeaturesnode e2e tests using cgroup v1.Signed-off-by: Harshal Patil [email protected]