Skip to content

exec probes that exceed timeout always succeed when using ExecProbeTimeout=false #100198

@jackfrancis

Description

@jackfrancis

What happened:

We've observed that when using a ExecProbeTimeout=false kubelet feature gate configuration with dockershim (e.g., docker/moby CRI) that exec probes have no effect if their command runtime takes longer than the configured probe timeoutSeconds.

This is different than the buggy behavior of dockershim probe timeout prior to the release of 1.20. That previous behavior, which ExecProbeTimeout=false is meant to reproduce in 1.20 and beyond, simply ignored timeouts, allowing probes "unlimited" time to complete, and respecting the eventual probe command exit code.

What we are observing now: Instead of timeouts being entirely ignored when using ExecProbeTimeout=false, a probe exceeding the timeout has the effect of short-circuiting the context between the probe and the container runtime entirely, and thus probes seem to "always succeed". What's really happening is that the outcome of the probe, after its time duration exceeds the configured timeout (1 second by default), is ignored. The container continues running.

What you expected to happen:

I would expect for a probe command that took longer than the configured timeout to still be operative, and for a non-zero probe command outcome to result in the container being killed, and the restart property incremented by 1.

How to reproduce it (as minimally and precisely as possible):

  1. Create a Kubernetes cluster >= 1.20.0 with a kubelet feature gate configuration that includes ExecProbeTimeout=false. (we tested in Azure using a docker CRI)
  2. Launch a container that runs forever, for example a simple busybox Pod spec with a command like while true; do sleep 5; done, and include an exec livenessProbe that does something that will always return non-zero, like sleep 10 & exit 1; the default 1 second timeout will ensure that a probe command such as that one always exceeds the timeout; and the ExecProbeTimeout=false kubelet feature gate config will ensure that that timeout is ignored
  3. Observe that that busybox container running the above spec will not restart; thus, the probe failure is not being respected

Anything else we need to know?:

@chewong created a simple test case that proves this here:

chewong@e3632dd

Environment:

  • Kubernetes version (use kubectl version): 1.20.0 and up
  • Cloud provider or hardware configuration: Azure
  • OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions