From the kubelet goroutines, we found the routine is stuck in initializeRuntimeDependentModules
initializeRuntimeDependentModules --->
kl.cadvisor.Start() --->
err = self.createContainer("/", watcher.Raw) --->
createContainerLocked --->
actory.NewContainerHandler --->
newContainerdContainerHandler --->
TaskPid
so we doubt the task service in the containerd went wrong. so we checked containerd's task service using
ctr -a /var/containerd/containerd.sock -n k8s.io task list
this command is hung. it cannot be responsed.
containerd's goroutine is stuck in
func processFromContainerd(ctx context.Context, p runtime.Process) (*task.Process, error) {
state, err := p.State(ctx)
if err != nil {
return nil, err
}
the error log
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.945987734-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=904ccd129b84c8849d40f0b7f3e80a86098eb53c8705ab0f200f26b5bb4df1fb
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.946070031-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=e2ce13abfa4bfd5f88f73846a9e5f56c49c6d79c82965ed84904a386a3dbcf3c
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.946089441-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=52713c9e6701d52efcdc5604c2ec327e529d551f51e10bc1935ebb24a0fa1edd
Jul 23 0
the containerd-shim of 904ccd129b84c8849d40f0b7f3e80a86098eb53c8705ab0f200f26b5bb4df1fb cannot reply the state message
-bash-4.2# crictl ps -a | grep 04c
904ccd129b84c d3d0838b080e6 49 years ago Unknown ner 0 049c8ae78b8b6
the pod which has ner container is not on this node anymore
Using kill -SIGUSR1 cannot dump the containerd-shim goroutine.
After killing this contained-shim, the containerd and kubelet come back
containerd: 1.2.4
shim: v1
Question:
due to the context is not with timeout, the rpc call will be hung forever.
If one containerd-shim is not responsive, restarting kubelet will lead to NodeNotReady
is it reasonable? Can we have a better way to handle this unknown containerd-shim?
From the kubelet goroutines, we found the routine is stuck in
initializeRuntimeDependentModulesso we doubt the task service in the containerd went wrong. so we checked containerd's task service using
this command is hung. it cannot be responsed.
containerd's goroutine is stuck in
the error log
the containerd-shim of
904ccd129b84c8849d40f0b7f3e80a86098eb53c8705ab0f200f26b5bb4df1fbcannot reply thestatemessagethe pod which has
nercontainer is not on this node anymoreUsing
kill -SIGUSR1cannot dump the containerd-shim goroutine.After killing this contained-shim, the containerd and kubelet come back
containerd: 1.2.4
shim: v1
Question:
due to the context is not
with timeout, the rpc call will be hung forever.If one containerd-shim is not responsive, restarting kubelet will lead to
NodeNotReadyis it reasonable? Can we have a better way to handle this unknown containerd-shim?