-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
What is the problem you're trying to solve
In current design, the containerd runtime v2 task.Delete API will ignore the Shutdown error, for instance context.Canceled or context.DeadlineExceed, and then delete shim instance from memory. If the shim doesn't receive the Shutdown call, the shim won't exit by itself. And then it becomes leaky shim.
Lines 471 to 481 in ba852fa
| // Don't shutdown sandbox as there may be other containers running. | |
| // Let controller decide when to shutdown. | |
| if !sandboxed { | |
| if err := s.waitShutdown(ctx); err != nil { | |
| // FIXME(fuweid): | |
| // | |
| // If the error is context canceled, should we use context.TODO() | |
| // to wait for it? | |
| log.G(ctx).WithField("id", s.ID()).WithError(err).Error("failed to shutdown shim task and the shim might be leaked") | |
| } | |
| } |
I uses Shutdown failpoint to reproduce this issue in #8954, please checkout the code
| func TestIssue7496_ShouldRetryShutdown(t *testing.T) { |
I want to fix the leaky shim issue from API design, instead of adding easy-break patches. The key I think is to make the task.Delete API retriable.
Related issues:
- shim process leaked when the host having high disk i/o usage #7496
- Containerd-Shim leak when creating/delete pod frequently #8931
Describe the solution you'd like
Before we make the task.Delete API retriable, we should rethink about the task event and runc-shim delete binary call.
Currently, both task.Exit and task.Delete can be sent by shim or containerd runtime v2 plugin.
For containerd runtime v2 plugin, it doesn't know whether the shim has sent it or not.
So, the runtime v2 plugin will send the event to subscriber (like moby/moby) only if the shim.Delete API call is successful.
Lines 149 to 163 in ba852fa
| func cleanupAfterDeadShim(ctx context.Context, id string, rt *runtime.NSMap[ShimInstance], events *exchange.Exchange, binaryCall *binary) { | |
| ctx, cancel := timeout.WithContext(ctx, cleanupTimeout) | |
| defer cancel() | |
| log.G(ctx).WithField("id", id).Warn("cleaning up after shim disconnected") | |
| response, err := binaryCall.Delete(ctx) | |
| if err != nil { | |
| log.G(ctx).WithError(err).WithField("id", id).Warn("failed to clean up after shim disconnected") | |
| } | |
| if _, err := rt.Get(ctx, id); err != nil { | |
| // Task was never started or was already successfully deleted | |
| // No need to publish events | |
| return | |
| } |
And the runc-shim doesn't persist the exit code of container in bundle dir. The runc-shim always use 137 as exit code even if the container exit code is zero. For example, #4769.
I think we should define the shim.Delete binary call with more detail in https://github.com/containerd/containerd/blob/main/runtime/v2/README.md#delete: The shim implementation should ensure that the exit code value from shim.Delete binary call should be aligned with shim.Delete API call.
There is one question we need to confirm:
- Is it allowed to send the same event multiple times?
If so and the shim.Delete binary call can returns the real exit code, the retriable task.Delete will be like:
func (s *shimTask) delete(ctx context.Context, sandboxed bool, removeTask func(ctx context.Context, id string)) (*runtime.Exit, error) {
response, shimErr := s.task.Delete(ctx, &task.DeleteRequest{ID: s.ID() })
if shimErr != nil {
log.G(ctx).WithField("id", s.ID()).WithError(shimErr).Debug("failed to delete task")
if !errors.Is(shimErr, ttrpc.ErrClosed) {
shimErr = errdefs.FromGRPC(shimErr)
if !errdefs.IsNotFound(shimErr) {
return nil, shimErr
}
}
}
if err := s.waitShutdown(ctx); err != nil {
return err
}
if err := s.ShimInstance.Delete(ctx); err != nil {
return err
}
// send the exit and delete events
removeTask(ctx, s.ID())
return &runtime.Exit{
Status: response.ExitStatus,
Timestamp: protobuf.FromTimestamp(response.ExitedAt),
Pid: response.Pid,
}, nil
}If it's not accecpted to send the same event multiple time, I think we should implement the event system in runtime v2, instead of shim server.
Additional context
I think the moby/moby should not rely on the event to change task status. moby should have relist function to ensure the status is not out-of-date.
And IMO, the shim.Wait API call is better than task event. We don't miss any event based on task.Wait, for instance, CRI plugin.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status