container runtime is down after kubelet is restarted

From the kubelet goroutines, we found the routine is stuck in `initializeRuntimeDependentModules`
```
initializeRuntimeDependentModules  --->
kl.cadvisor.Start() --->
err = self.createContainer("/", watcher.Raw) --->
createContainerLocked  --->
actory.NewContainerHandler --->
newContainerdContainerHandler --->
TaskPid
```
so we doubt the task service in the containerd went wrong. so we checked containerd's task service using
```
ctr -a /var/containerd/containerd.sock -n k8s.io task list
```
this command is hung. it cannot be responsed.
containerd's goroutine is stuck in 
```
func processFromContainerd(ctx context.Context, p runtime.Process) (*task.Process, error) {
	state, err := p.State(ctx)
	if err != nil {
		return nil, err
	}
```
the error log
```
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.945987734-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=904ccd129b84c8849d40f0b7f3e80a86098eb53c8705ab0f200f26b5bb4df1fb
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.946070031-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=e2ce13abfa4bfd5f88f73846a9e5f56c49c6d79c82965ed84904a386a3dbcf3c
Jul 23 01:47:24 tess-node-lfb3k-1287902.stratus.slc.ebay.com containerd[9066]: time="2019-07-23T01:47:24.946089441-07:00" level=error msg="converting task to protobuf" error="context canceled: unknown" id=52713c9e6701d52efcdc5604c2ec327e529d551f51e10bc1935ebb24a0fa1edd
Jul 23 0
```

the containerd-shim of `904ccd129b84c8849d40f0b7f3e80a86098eb53c8705ab0f200f26b5bb4df1fb` cannot reply the `state` message

```
-bash-4.2# crictl ps -a | grep 04c
904ccd129b84c       d3d0838b080e6       49 years ago        Unknown             ner                             0                   049c8ae78b8b6
```
the pod which has `ner` container is not on this node anymore

Using `kill -SIGUSR1` cannot dump the containerd-shim goroutine.

After killing this contained-shim, the containerd and kubelet come back

containerd: 1.2.4
shim: v1

Question: 
due to the context is not `with timeout`, the rpc call will be hung forever.
If one containerd-shim is not responsive, restarting kubelet will lead to `NodeNotReady`
is it reasonable?  Can we have a better way to handle this unknown containerd-shim? 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container runtime is down after kubelet is restarted #3440

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

container runtime is down after kubelet is restarted #3440

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions