runtime: runc v2 shim closes platform prematurely and multiple times

This was found and debugged in firecracker-containerd, [the issue there has more details+logs](https://github.com/firecracker-microvm/firecracker-containerd/issues/363). We are currently using the 1.3 branch but it appears to be present in master and older versions that support the runc v2 multishim too.

The runc v2 shim seems to have an issue that makes it possible for `platform.Close()` to be called when the shim is not done servicing containers and to be called multiple times. It happens when:
1. The shim starts and a single task is created+deleted, resulting in [these lines that call `platform.Close()` to be executed](https://github.com/containerd/containerd/blob/ff48f57fc83a8c44cf4ad5d672424a98ba37ded6/runtime/v2/runc/v2/service.go#L351-L366).
1. Another task is created before the shim receives a shutdown call, meaning [`Shutdown` returns early and the shim continues to service containers](https://github.com/containerd/containerd/blob/ff48f57fc83a8c44cf4ad5d672424a98ba37ded6/runtime/v2/runc/v2/service.go#L595-L600).
1. Any further tasks will no longer be able to use the platform functionality as it has been closed already.

This also creates problems when further containers are deleted because, due to the implementation of the platform console code on linux, calling `platform.Close()` multiple times results in [the same already-closed FD value to be closed again](https://github.com/containerd/containerd/blob/9b5581cc9c5bb4e9f4e47606ba883234167b1c23/vendor/github.com/containerd/console/console_linux.go#L154). If that FD value was reused after the first close, `platform.Close()` will be closing a random FD in the process (which is what firecracker-containerd was rarely observing as `EBADF` errors from an unrelated call to `os.RemoveAll` happening elsewhere in the process). The fact that closing the console multiple times allows this is probably a separate bug that should also be addressed; I will open a separate issue in the console repo for it.

This problem was especially noticeable in firecracker-containerd because we wrap around the runc v2 shim and support a flag that tells the shim to run continuously even when there are no containers left (waiting instead for a separate custom `StopVM` call).

I have a fix to call `platform.Close()` only when the shim's context is canceled in `Shutdown`, just creating this issue to reference there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: runc v2 shim closes platform prematurely and multiple times #3895

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: runc v2 shim closes platform prematurely and multiple times #3895

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions