Skip to content

Hung volumes can wedge the kubelet #31272

@ncdc

Description

@ncdc

If you have pods that use something like NFS storage, if the system is unable to read the mounted directory, or unmount it, it is possible to completely wedge the kubelet such that it can't successfully run any new pods that use volumes (which is basically all, if they use secret tokens) until either the storage issue is resolved, or you restart the kubelet.

To reproduce:

  1. Create a pod that uses an NFS volume
  2. Stop the NFS server
  3. Try to delete the pod
  4. Wait a couple of minutes - you'll see the pod stuck Terminating
  5. Try to create a new pod (kubectl run --rm --attach --restart Never --image busybox bbox date)

The busybox pod will be stuck ContainerCreating with events such as these:

Events:
  FirstSeen     LastSeen        Count   From                    SubobjectPath   Type            Reason          Message
  ---------     --------        -----   ----                    -------------   --------        ------          -------
  8m            8m              1       {default-scheduler }                    Normal          Scheduled       Successfully assigned bbox to 127.0.0.1
  6m            6s              4       {kubelet 127.0.0.1}                     Warning         FailedMount     Unable to mount volumes for pod "bbox_default(12a20cdb-694f-11e6-baa4-001c42e13e5d)": timeout expired waiting for volumes to attach/mount for pod "bbox"/"default". list of unattached/unmounted volumes=[default-token-joiwi]
  6m            6s              4       {kubelet 127.0.0.1}                     Warning         FailedSync      Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "bbox"/"default". list of unattached/unmounted volumes=[default-token-joiwi]

In this stack trace I gathered after I deleted the pod, it shows that the volume reconciler is still trying to get the volumes for the pod I just deleted. You'll also see a goroutine trying to stop the Docker container, but it is stuck.

In this stack trace I gathered after I tried to create the bbox pod, it shows that the new pod (bbox) is waiting for its volumes to attach/mount (in this case, secrets).

We've seen this in 1.2.x and I just reproduced it in master (commit f297ea9).

cc @kubernetes/sig-storage @kubernetes/sig-node @kubernetes/rh-cluster-infra @pmorie @derekwaynecarr @timothysc @saad-ali

Metadata

Metadata

Assignees

Labels

area/kubeletkind/bugCategorizes issue or PR as related to a bug.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/storageCategorizes an issue or PR as relevant to SIG Storage.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions