Skip to content

Conversation

@andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Nov 12, 2020

Signed-off-by: Andrew Sy Kim [email protected]

What type of PR is this?
/kind bug

What this PR does / why we need it:
In #94115 we fixed a bug where kubelet did not respect exec probe timeouts. That PR also re-enabled some tests to ensure we don't regress. So far all the tests using containerd have been passing, but some tests using dockershim started to fail (see #96463). This wasn't caught during presubmit since jobs that run dockershim only run e2es marked with [NodeConformance].

The bug is in the CRI implementation of dockershim where it should be returning context.Deadline in the ExecSync response instead of exec.TimedoutError. TimedoutError is actually the error as expected by the prober, which is what should be returned by the gRPC client, not the server. The server is expected to return context.DeadlineExeceeded, which results in the cri client returning TimedoutError back to the prober.

Which issue(s) this PR fixes:
Fixes #96463

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 12, 2020
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 12, 2020
@andrewsykim andrewsykim force-pushed the dockershim-exec-context branch from 5b79be9 to 5a1b531 Compare November 12, 2020 03:34
@andrewsykim
Copy link
Member Author

/assign @SergeyKanzhelev

@andrewsykim andrewsykim force-pushed the dockershim-exec-context branch 2 times, most recently from 3517e5f to 9a03598 Compare November 12, 2020 03:46
Copy link
Contributor

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim looks like this may break some of the other probing container tests

@andrewsykim
Copy link
Member Author

@hasheddan that's interesting, I thought none of the presubmits run docker (see #96463 (comment)), but this PR only touches dockreshim.

@hasheddan
Copy link
Contributor

@andrewsykim so it looks like the presubmit does use dockershim, but our periodic does not. From the failing test logs:

NodeSystemInfo{MachineID:3733c809a1e5b150506bec770a5c5c70,SystemUUID:3733c809-a1e5-b150-506b-ec770a5c5c70,BootID:4cd37059-30d4-4a94-86d5-9a1c18597431,KernelVersion:5.3.0-1016-gke,OSImage:Ubuntu 18.04.4 LTS,ContainerRuntimeVersion:docker://19.3.2,KubeletVersion:v1.20.0-beta.1.466+011d0e2ffaf85f,KubeProxyVersion:v1.20.0-beta.1.466+011d0e2ffaf85f,OperatingSystem:linux,Architecture:amd64,}

I know @oomichi tested out that the test did in fact fail without your PR but it wasn't pull-kubernetes-node-e2e, it was pull-kubernetes-e2e-gce-ubuntu-containerd (see #96178). I am looking at differences between the pull-kubernetes-node-e2e presubmit and the gce-ubuntu-master-default periodic now. I do notice that we are using different Ubuntu versions in each:

  • presubmit: ubuntu-gke-1804-1-16-v20200330
  • periodic: ubuntu-2004-focal-v20200423

@hasheddan
Copy link
Contributor

@andrewsykim ah found it, we are skipping the test that is failing in the periodic

@hasheddan
Copy link
Contributor

xref kubernetes/test-infra#19934

@andrewsykim andrewsykim force-pushed the dockershim-exec-context branch 3 times, most recently from ed47ee3 to ab34ee5 Compare November 13, 2020 00:20
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 13, 2020
@andrewsykim andrewsykim force-pushed the dockershim-exec-context branch from ab34ee5 to 56fd817 Compare November 13, 2020 01:54
@andrewsykim
Copy link
Member Author

/milestone v1.20

@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 16, 2020
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing bad will happen, but maybe need to call cancel()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel is called with defer on line 112

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, totally. My comment is that we can cancel timeout in case the exec has already finished =)

@andrewsykim andrewsykim force-pushed the dockershim-exec-context branch from bbdcd20 to 3391546 Compare November 17, 2020 03:52
@SergeyKanzhelev
Copy link
Member

/test pull-kubernetes-e2e-azure-disk-windows

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the one comment about not adding NodeConformance tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree we should defer labeling this NodeConformance

@andrewsykim andrewsykim force-pushed the dockershim-exec-context branch from 3391546 to f5a82f7 Compare November 17, 2020 15:02
@andrewsykim
Copy link
Member Author

@derekwaynecarr removed exec probe timeout tests from [NodeConformance], PTAL

@SergeyKanzhelev
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2020
@andrewsykim
Copy link
Member Author

/kind failing-test
/priority critical-urgent
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2020
Copy link
Contributor

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@derekwaynecarr could you take another look here and approve if looks good to you?

@derekwaynecarr
Copy link
Member

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, derekwaynecarr, hasheddan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Failing Test] [k8s.io] Probing container should be restarted with a docker exec liveness probe with timeout [ci-kubernetes-e2e-ubuntu-gce]

7 participants