Skip to content

Pods stuck on terminating #100695

@Hitesh-Agrawal

Description

@Hitesh-Agrawal

What happened:

Pods stuck on terminating

What you expected to happen:

Pods to Terminated after failing Readiness and Liveliness Probe

How to reproduce it (as minimally and precisely as possible):

  1. Create a Deployment.
  2. kubelet fails to delete and recreate the pod after terminationGracePeriodSeconds: 300
  3. Had to forcefully delete the pod after it was stuck in Terminating State.

Anything else we need to know?:

BElow is the deployment.yaml (I have redacted confidential info)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ABC
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: ABC
strategy:
type: Recreate
template:
metadata:
annotations:
checksum/config: xxxxxxxxxxxxxxx
labels:
app: ABC
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ABC
namespaces: ["default"]
topologyKey: failure-domain.beta.kubernetes.io/zone
containers:
- name: abc
image: abc:latest
workingDir: /work
resources:
requests:
cpu: 0
memory: 1Gi
limits:
cpu: 2
memory: 1Gi
readinessProbe:
tcpSocket:
port: 6565
initialDelaySeconds: 10
timeoutSeconds: 5
livenessProbe:
tcpSocket:
port: 6565
initialDelaySeconds: 30
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /abc
subPath: config.conf
readOnly: true
- name: config
mountPath: /abc
subPath: config.json
readOnly: true
- name: archive
mountPath: /archive
subPath: /abc
- name: telegraf
image: telegraf:latest
resources:
requests:
cpu: 0
memory: 96Mi
limits:
cpu: 1
memory: 96Mi
terminationGracePeriodSeconds: 300
volumes:
- name: config
configMap:
name: ABC
- name: archive
nfs:
server: "archive-server"
path: /

Environment:

  • Kubernetes version (use kubectl version):
  • kubectl version
    Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Docker Version
  • docker version
    Client: Docker Engine - Community
    Version: 19.03.14
    API version: 1.40
    Go version: go1.13.15
    Git commit: 5eb3275d40
    Built: Tue Dec 1 19:20:42 2020
    OS/Arch: linux/amd64
    Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.14
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:19:17 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.18.0
GitCommit: fec3683
-Containerd Version
containerd --version
containerd containerd.io 1.4.3 269548fa27e0089a8b8278fc4fc781d7f65a939b

  • Cloud provider or hardware configuration:
  • AWS / m5d.8xlarge
  • OS (e.g: cat /etc/os-release):
  • CentOS Linux release 7.9.2009 (Core)
  • Kernel (e.g. uname -a):
  • Linux SERVERNAME 4.4.245-1.el7.elrepo.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Fri Nov 20 09:39:52 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
    Here are the logs

kubelet.log snippet as it was trying to delete the container
`E0328 00:05:15.236267 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda ("XXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)"), skipping: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
E0328 00:10:16.234249 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda ("XXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)"), skipping: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

STUCK FOR 16 HOURS 43 MINUTES

E0328 16:43:16.796493 17532 kubelet.go:1576] error killing pod: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
E0328 16:43:16.796513 17532 pod_workers.go:191] Error syncing pod 75ba2b6a-b39c-4745-b6a6-e2bf4d02afda ("XXXXXXXXXXXXXXXX(75ba2b6a-b39c-4745-b6a6-e2bf4d02afda)"), skipping: error killing pod: failed to "KillContainer" for "ABC" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"`

==================
DOCKER DAEMON LOGS FOR NEXT DAY WHEN THE POD WAS LONG GONE , FORCEFULLY DELETED BY US

Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.402804589Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403058811Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403283589Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Mar 29 00:00:03 SERVERNAME dockerd[15521]: time="2021-03-29T00:00:03.403525118Z" level=error msg="Handler for GET /containers/92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

=====================
CONTAINERD LOG -- containerd deleted the container when requested, but as seen above dockerd is still quering its status till next day. and kubelet is stuck cause its waiting on dockerd to return the successful deletion of the pod.

/var/log/messages-20210328:Mar 28 00:00:04 SERVERNAME containerd: time="2021-03-28T00:00:04.927077043Z" level=info msg="shim reaped" id=92895962c9d859b3dee39914660f9d74925d2f373251db7e19005c4445e8fa99

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions