Skip to content

Containerd should set a default timeout to containerd-shim operations. #2578

@Random-Liu

Description

@Random-Liu

No matter in shim v1 (https://github.com/containerd/containerd/blob/master/runtime/v1/shim/v1/shim.proto) or v2 (https://github.com/containerd/containerd/blob/master/runtime/v2/task/shim.proto), the only long running method is Wait. It makes sense to set a default timeout for all short running methods.

For Wait, we can leave the timeout to the caller, and it is intuitive for a Wait caller to set a timeout.

This is important because today if a containerd-shim hangs:

  1. Batch container operations will hang, e.g. ctr task ls, CRI plugin restart recovery etc.
  2. Any operation to the hang containerd-shim will cause a containerd daemon-wise goroutine leakage.

I think it is less serious to leak goroutine in containerd-shim, because it only affects one container. However, leaking goroutine in containerd seems pretty bad to me. And think about users running periodic exec or state, there will be tons of goroutines leaked.

And we do need to consider the case that containerd-shim may become unresponsive, because:

  1. Even the default containerd-shim sometimes hangs. e.g. Inconsistent state on pod termination #2438, containerd hangs #1882
  2. Containerd-shim can be plugins in the future, the behavior will be even more unexpected. We may not want people come to containerd repo, file issue about hanging containerd, but turns out to be the containerd-shim they are using hangs.

@containerd/containerd-maintainers

And there is one blocker containerd/ttrpc#3.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions