-
Notifications
You must be signed in to change notification settings - Fork 42k
Description
What happened:
We've observed that when using a ExecProbeTimeout=false kubelet feature gate configuration with dockershim (e.g., docker/moby CRI) that exec probes have no effect if their command runtime takes longer than the configured probe timeoutSeconds.
This is different than the buggy behavior of dockershim probe timeout prior to the release of 1.20. That previous behavior, which ExecProbeTimeout=false is meant to reproduce in 1.20 and beyond, simply ignored timeouts, allowing probes "unlimited" time to complete, and respecting the eventual probe command exit code.
What we are observing now: Instead of timeouts being entirely ignored when using ExecProbeTimeout=false, a probe exceeding the timeout has the effect of short-circuiting the context between the probe and the container runtime entirely, and thus probes seem to "always succeed". What's really happening is that the outcome of the probe, after its time duration exceeds the configured timeout (1 second by default), is ignored. The container continues running.
What you expected to happen:
I would expect for a probe command that took longer than the configured timeout to still be operative, and for a non-zero probe command outcome to result in the container being killed, and the restart property incremented by 1.
How to reproduce it (as minimally and precisely as possible):
- Create a Kubernetes cluster >= 1.20.0 with a kubelet feature gate configuration that includes
ExecProbeTimeout=false. (we tested in Azure using a docker CRI) - Launch a container that runs forever, for example a simple
busyboxPod spec with a command likewhile true; do sleep 5; done, and include an exec livenessProbe that does something that will always return non-zero, likesleep 10 & exit 1; the default 1 second timeout will ensure that a probe command such as that one always exceeds the timeout; and theExecProbeTimeout=falsekubelet feature gate config will ensure that that timeout is ignored - Observe that that busybox container running the above spec will not restart; thus, the probe failure is not being respected
Anything else we need to know?:
@chewong created a simple test case that proves this here:
Environment:
- Kubernetes version (use
kubectl version): 1.20.0 and up - Cloud provider or hardware configuration: Azure
- OS (e.g:
cat /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
- Kernel (e.g.
uname -a): Linux k8s-master-46627706-0 5.4.0-1040-azure Tiny typos / removing anchor (target page updated) #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - Install tools: Cluster built w/ aks-engine
- Network plugin and version (if this is a network-related bug): n/a
- Others: