-
Notifications
You must be signed in to change notification settings - Fork 42k
Description
If you have pods that use something like NFS storage, if the system is unable to read the mounted directory, or unmount it, it is possible to completely wedge the kubelet such that it can't successfully run any new pods that use volumes (which is basically all, if they use secret tokens) until either the storage issue is resolved, or you restart the kubelet.
To reproduce:
- Create a pod that uses an NFS volume
- Stop the NFS server
- Try to delete the pod
- Wait a couple of minutes - you'll see the pod stuck Terminating
- Try to create a new pod (
kubectl run --rm --attach --restart Never --image busybox bbox date)
The busybox pod will be stuck ContainerCreating with events such as these:
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
8m 8m 1 {default-scheduler } Normal Scheduled Successfully assigned bbox to 127.0.0.1
6m 6s 4 {kubelet 127.0.0.1} Warning FailedMount Unable to mount volumes for pod "bbox_default(12a20cdb-694f-11e6-baa4-001c42e13e5d)": timeout expired waiting for volumes to attach/mount for pod "bbox"/"default". list of unattached/unmounted volumes=[default-token-joiwi]
6m 6s 4 {kubelet 127.0.0.1} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "bbox"/"default". list of unattached/unmounted volumes=[default-token-joiwi]
In this stack trace I gathered after I deleted the pod, it shows that the volume reconciler is still trying to get the volumes for the pod I just deleted. You'll also see a goroutine trying to stop the Docker container, but it is stuck.
In this stack trace I gathered after I tried to create the bbox pod, it shows that the new pod (bbox) is waiting for its volumes to attach/mount (in this case, secrets).
We've seen this in 1.2.x and I just reproduced it in master (commit f297ea9).
cc @kubernetes/sig-storage @kubernetes/sig-node @kubernetes/rh-cluster-infra @pmorie @derekwaynecarr @timothysc @saad-ali