Skip to content

container runtime is down after kubelet is restarted #3440

@sofat1989

Description

@sofat1989

From the kubelet goroutines, we found the routine is stuck in initializeRuntimeDependentModules

initializeRuntimeDependentModules  --->
kl.cadvisor.Start() --->
err = self.createContainer("/", watcher.Raw) --->
createContainerLocked  --->
actory.NewContainerHandler --->
newContainerdContainerHandler --->
TaskPid

so we doubt the task service in the containerd went wrong. so we checked containerd's task service using

ctr -a /var/containerd/containerd.sock -n k8s.io task list

this command is hung. it cannot be responsed.
containerd's goroutine is stuck in

func processFromContainerd(ctx context.Context, p runtime.Process) (*task.Process, error) {
	state, err := p.State(ctx)
	if err != nil {
		return nil, err
	}

the error log

Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.945987734-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=904ccd129b84c8849d40f0b7f3e80a86098eb53c8705ab0f200f26b5bb4df1fb
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.946070031-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=e2ce13abfa4bfd5f88f73846a9e5f56c49c6d79c82965ed84904a386a3dbcf3c
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.946089441-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=52713c9e6701d52efcdc5604c2ec327e529d551f51e10bc1935ebb24a0fa1edd
Jul 23 0

the containerd-shim of 904ccd129b84c8849d40f0b7f3e80a86098eb53c8705ab0f200f26b5bb4df1fb cannot reply the state message

-bash-4.2# crictl ps -a | grep 04c
904ccd129b84c       d3d0838b080e6       49 years ago        Unknown             ner                             0                   049c8ae78b8b6

the pod which has ner container is not on this node anymore

Using kill -SIGUSR1 cannot dump the containerd-shim goroutine.

After killing this contained-shim, the containerd and kubelet come back

containerd: 1.2.4
shim: v1

Question:
due to the context is not with timeout, the rpc call will be hung forever.
If one containerd-shim is not responsive, restarting kubelet will lead to NodeNotReady
is it reasonable? Can we have a better way to handle this unknown containerd-shim?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions