-
Notifications
You must be signed in to change notification settings - Fork 42k
Description
What happened:
When terminationGracePeriodSeconds for pod is higher than 9999999999999999, Pod hangs forever in Terminating state. This may lead to node become NotReady.
Kubelet logs:
Oct 24 10:57:33 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:33.969080 4169 kubelet.go:1553] error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Oct 24 10:57:33 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:33.969118 4169 pod_workers.go:190] Error syncing pod 08e54efc-f64d-11e9-b793-3aec29032c39 ("foo_default(08e54efc-f64d-11e9-b793-3aec29032c39)"), skipping: error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Oct 24 10:57:34 aks-pool1-19572124-vmss000000 kubelet[4169]: I1024 10:57:34.689620 4169 kubelet.go:1932] SyncLoop (PLEG): "foo_default(08e54efc-f64d-11e9-b793-3aec29032c39)", event: &pleg.PodLifecycleEvent{ID:"08e54efc-f64d-11e9-b793-3aec29032c39", Type:"ContainerDied", Data:"0b5f8ccd31a6e3f950abfde25ec378e9fe48da12ca04425b622b550d94047951"}
Oct 24 10:57:34 aks-pool1-19572124-vmss000000 kubelet[4169]: W1024 10:57:34.689743 4169 pod_container_deletor.go:75] Container "0b5f8ccd31a6e3f950abfde25ec378e9fe48da12ca04425b622b550d94047951" not found in pod's containers
Oct 24 10:57:34 aks-pool1-19572124-vmss000000 kubelet[4169]: I1024 10:57:34.689957 4169 kuberuntime_container.go:581] Killing container "docker://69554aafdc256a3cda591cbcc942f0c35a222672311a3d725da15bb4939fd0d7" with 99999999999999999 second grace period
Oct 24 10:57:34 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:34.690039 4169 remote_runtime.go:250] StopContainer "69554aafdc256a3cda591cbcc942f0c35a222672311a3d725da15bb4939fd0d7" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Oct 24 10:57:34 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:34.690065 4169 kuberuntime_container.go:585] Container "docker://69554aafdc256a3cda591cbcc942f0c35a222672311a3d725da15bb4939fd0d7" termination failed with gracePeriod 99999999999999999: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Oct 24 10:57:34 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:34.691582 4169 kubelet.go:1553] error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Oct 24 10:57:34 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:34.691603 4169 pod_workers.go:190] Error syncing pod 08e54efc-f64d-11e9-b793-3aec29032c39 ("foo_default(08e54efc-f64d-11e9-b793-3aec29032c39)"), skipping: error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Oct 24 10:57:47 aks-pool1-19572124-vmss000000 kubelet[4169]: I1024 10:57:47.530264 4169 kuberuntime_container.go:581] Killing container "docker://69554aafdc256a3cda591cbcc942f0c35a222672311a3d725da15bb4939fd0d7" with 99999999999999999 second grace period
Oct 24 10:57:47 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:47.530346 4169 remote_runtime.go:250] StopContainer "69554aafdc256a3cda591cbcc942f0c35a222672311a3d725da15bb4939fd0d7" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Oct 24 10:57:47 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:47.530390 4169 kuberuntime_container.go:585] Container "docker://69554aafdc256a3cda591cbcc942f0c35a222672311a3d725da15bb4939fd0d7" termination failed with gracePeriod 99999999999999999: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Oct 24 10:57:47 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:47.532310 4169 kubelet.go:1553] error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Oct 24 10:57:47 aks-pool1-19572124-vmss000000 kubelet[4169]: E1024 10:57:47.532337 4169 pod_workers.go:190] Error syncing pod 08e54efc-f64d-11e9-b793-3aec29032c39 ("foo_default(08e54efc-f64d-11e9-b793-3aec29032c39)"), skipping: error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
What you expected to happen:
Pods should not pile up on the node.
How to reproduce it (as minimally and precisely as possible):
Create following deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
name: foo
labels:
app: nginx
spec:
# Working
#
#terminationGracePeriodSeconds: 99999999999999
#terminationGracePeriodSeconds: 999999999999999
#terminationGracePeriodSeconds: 9999999999999999
# Not working. Gives following error message:
# skipping: error killing pod: failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
#
# Remove with 'kubectl delete pods --force --grace-period=0 foo'
#
terminationGracePeriodSeconds: 99999999999999999
#terminationGracePeriodSeconds: 999999999999999999
#terminationGracePeriodSeconds: 1000000000000000000
containers:
- name: nginx
image: nginxOnce deployed, run following command:
while true; do kubectl delete --wait=false pods $(kubectl get pods | grep Running | awk '{print $1}' | tr \\n ' '); sleep 5; doneAlternatively, Pod re-creation can be triggered by scaling a deployment up and down:
k scale deployment nginx --replicas=5; sleep 5; k scale deployment nginx --replicas=0Anything else we need to know?:
It seems that terminationGracePeriodSeconds is taken as int64 and then coverted to time.Duration here, which overflows if you pass value bigger than 270 years, since time.Duration stores nanoseconds.
The pods will pile up until you hit maxPods limit on the node or you run out of memory, as main container process (nginx) is never killed (pause pod is being removed though).
I think ideally terminationGracePeriodSeconds should be replaced with terminationGracePeriod, which would be a string type, so time.ParseDuration can be used for it, which returns an error if the value overflows. Example here: https://play.golang.org/p/_844zxthbRp
As a workaround, hard limit could be set on this field to prevent putting such values.
Also related Golang issue about unsafe time: golang/go#20757
Looks like Pods can be removed using kubectl delete pods --force --grace-period=0 and then Docker containers are removed as well.
Environment:
- Kubernetes version (use
kubectl version):Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.7", GitCommit:"8fca2ec50a6133511b771a11559e24191b1aa2b4", GitTreeState:"clean", BuildDate:"2019-09-18T14:39:02Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"} - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release):
$ cat /etc/os-release
NAME="Flatcar Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=2296.99.0
VERSION_ID=2296.99.0
BUILD_ID=2019-10-22-2150
PRETTY_NAME="Flatcar Linux by Kinvolk 2296.99.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar-linux.org/"
BUG_REPORT_URL="https://issues.flatcar-linux.org"
FLATCAR_BOARD="amd64-usr"
- Kernel (e.g.
uname -a):Linux controller-testing-1 5.3.7-flatcar #1 SMP Tue Oct 22 21:04:50 -00 2019 x86_64 Intel Xeon Processor (Skylake, IBRS) GenuineIntel GNU/Linux - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
- Docker version:
Client:
Version: 19.03.2
API version: 1.40
Go version: go1.13.3
Git commit: 6a30dfc
Built: Thu Aug 29 04:42:14 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.12)
Go version: go1.13.3
Git commit: 6a30dfc
Built: Thu Aug 29 04:42:14 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.8
GitCommit: a4bc1d432a2c33aa2eed37f338dceabb93641310
runc:
Version: 1.0.0-rc9+dev.docker-19.03
GitCommit: d736ef14f0288d6993a1845745d6756cfc9ddd5a
docker-init:
Version: 0.18.0
GitCommit: fec3683b971d9c3ef73f284f176672c44b448662
Metadata
Metadata
Assignees
Labels
Type
Projects
Status