-
Notifications
You must be signed in to change notification settings - Fork 42k
Description
What happened:
Pods stuck on terminating
What you expected to happen:
Pods to Terminated after failing Readiness and Liveliness Probe
How to reproduce it (as minimally and precisely as possible):
- Create a Deployment.
- kubelet fails to delete and recreate the pod after terminationGracePeriodSeconds: 300
- Had to forcefully delete the pod after it was stuck in Terminating State.
Anything else we need to know?:
BElow is the deployment.yaml (I have redacted confidential info)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ABC
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: ABC
strategy:
type: Recreate
template:
metadata:
annotations:
checksum/config: xxxxxxxxxxxxxxx
labels:
app: ABC
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ABC
namespaces: ["default"]
topologyKey: failure-domain.beta.kubernetes.io/zone
containers:
- name: abc
image: abc:latest
workingDir: /work
resources:
requests:
cpu: 0
memory: 1Gi
limits:
cpu: 2
memory: 1Gi
readinessProbe:
tcpSocket:
port: 6565
initialDelaySeconds: 10
timeoutSeconds: 5
livenessProbe:
tcpSocket:
port: 6565
initialDelaySeconds: 30
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /abc
subPath: config.conf
readOnly: true
- name: config
mountPath: /abc
subPath: config.json
readOnly: true
- name: archive
mountPath: /archive
subPath: /abc
- name: telegraf
image: telegraf:latest
resources:
requests:
cpu: 0
memory: 96Mi
limits:
cpu: 1
memory: 96Mi
terminationGracePeriodSeconds: 300
volumes:
- name: config
configMap:
name: ABC
- name: archive
nfs:
server: "archive-server"
path: /
Environment:
- Kubernetes version (use
kubectl version): - kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"} - Docker Version
- docker version
Client: Docker Engine - Community
Version: 19.03.14
API version: 1.40
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:20:42 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.14
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:19:17 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.18.0
GitCommit: fec3683
-Containerd Version
containerd --version
containerd containerd.io 1.4.3 269548fa27e0089a8b8278fc4fc781d7f65a939b
- Cloud provider or hardware configuration:
- AWS / m5d.8xlarge
- OS (e.g:
cat /etc/os-release): - CentOS Linux release 7.9.2009 (Core)
- Kernel (e.g.
uname -a): - Linux SERVERNAME 4.4.245-1.el7.elrepo.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Fri Nov 20 09:39:52 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
Here are the logs
kubelet.log snippet as it was trying to delete the container
`E0328 00:05:15.236267 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda ("XXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)"), skipping: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
E0328 00:10:16.234249 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda ("XXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)"), skipping: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
STUCK FOR 16 HOURS 43 MINUTES
E0328 16:43:16.796493 17532 kubelet.go:1576] error killing pod: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
E0328 16:43:16.796513 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda ("XXXXXXXXXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)"), skipping: error killing pod: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"`
==================
DOCKER DAEMON LOGS FOR NEXT DAY WHEN THE POD WAS LONG GONE , FORCEFULLY DELETED BY US
Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.402804589Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403058811Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403283589Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403525118Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
=====================
CONTAINERD LOG -- containerd deleted the container when requested, but as seen above dockerd is still quering its status till next day. and kubelet is stuck cause its waiting on dockerd to return the successful deletion of the pod.
/var/log/messages-20210328:Mar 28 00:00:04 SERVERNAME containerd: time="2021-03-28T00:00:04.927077043Z" level=info msg="shim reaped" id=92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99
Metadata
Metadata
Assignees
Labels
Type
Projects
Status