This was found and debugged in firecracker-containerd, the issue there has more details+logs. We are currently using the 1.3 branch but it appears to be present in master and older versions that support the runc v2 multishim too.
The runc v2 shim seems to have an issue that makes it possible for platform.Close() to be called when the shim is not done servicing containers and to be called multiple times. It happens when:
- The shim starts and a single task is created+deleted, resulting in these lines that call
platform.Close() to be executed.
- Another task is created before the shim receives a shutdown call, meaning
Shutdown returns early and the shim continues to service containers.
- Any further tasks will no longer be able to use the platform functionality as it has been closed already.
This also creates problems when further containers are deleted because, due to the implementation of the platform console code on linux, calling platform.Close() multiple times results in the same already-closed FD value to be closed again. If that FD value was reused after the first close, platform.Close() will be closing a random FD in the process (which is what firecracker-containerd was rarely observing as EBADF errors from an unrelated call to os.RemoveAll happening elsewhere in the process). The fact that closing the console multiple times allows this is probably a separate bug that should also be addressed; I will open a separate issue in the console repo for it.
This problem was especially noticeable in firecracker-containerd because we wrap around the runc v2 shim and support a flag that tells the shim to run continuously even when there are no containers left (waiting instead for a separate custom StopVM call).
I have a fix to call platform.Close() only when the shim's context is canceled in Shutdown, just creating this issue to reference there.
This was found and debugged in firecracker-containerd, the issue there has more details+logs. We are currently using the 1.3 branch but it appears to be present in master and older versions that support the runc v2 multishim too.
The runc v2 shim seems to have an issue that makes it possible for
platform.Close()to be called when the shim is not done servicing containers and to be called multiple times. It happens when:platform.Close()to be executed.Shutdownreturns early and the shim continues to service containers.This also creates problems when further containers are deleted because, due to the implementation of the platform console code on linux, calling
platform.Close()multiple times results in the same already-closed FD value to be closed again. If that FD value was reused after the first close,platform.Close()will be closing a random FD in the process (which is what firecracker-containerd was rarely observing asEBADFerrors from an unrelated call toos.RemoveAllhappening elsewhere in the process). The fact that closing the console multiple times allows this is probably a separate bug that should also be addressed; I will open a separate issue in the console repo for it.This problem was especially noticeable in firecracker-containerd because we wrap around the runc v2 shim and support a flag that tells the shim to run continuously even when there are no containers left (waiting instead for a separate custom
StopVMcall).I have a fix to call
platform.Close()only when the shim's context is canceled inShutdown, just creating this issue to reference there.