Skip to content

enhancement: make task.Delete API retriable #8981

@fuweid

Description

@fuweid

What is the problem you're trying to solve

In current design, the containerd runtime v2 task.Delete API will ignore the Shutdown error, for instance context.Canceled or context.DeadlineExceed, and then delete shim instance from memory. If the shim doesn't receive the Shutdown call, the shim won't exit by itself. And then it becomes leaky shim.

// Don't shutdown sandbox as there may be other containers running.
// Let controller decide when to shutdown.
if !sandboxed {
if err := s.waitShutdown(ctx); err != nil {
// FIXME(fuweid):
//
// If the error is context canceled, should we use context.TODO()
// to wait for it?
log.G(ctx).WithField("id", s.ID()).WithError(err).Error("failed to shutdown shim task and the shim might be leaked")
}
}

I uses Shutdown failpoint to reproduce this issue in #8954, please checkout the code

func TestIssue7496_ShouldRetryShutdown(t *testing.T) {
.

I want to fix the leaky shim issue from API design, instead of adding easy-break patches. The key I think is to make the task.Delete API retriable.

Related issues:

Describe the solution you'd like

Before we make the task.Delete API retriable, we should rethink about the task event and runc-shim delete binary call.

Currently, both task.Exit and task.Delete can be sent by shim or containerd runtime v2 plugin.
For containerd runtime v2 plugin, it doesn't know whether the shim has sent it or not.
So, the runtime v2 plugin will send the event to subscriber (like moby/moby) only if the shim.Delete API call is successful.

func cleanupAfterDeadShim(ctx context.Context, id string, rt *runtime.NSMap[ShimInstance], events *exchange.Exchange, binaryCall *binary) {
ctx, cancel := timeout.WithContext(ctx, cleanupTimeout)
defer cancel()
log.G(ctx).WithField("id", id).Warn("cleaning up after shim disconnected")
response, err := binaryCall.Delete(ctx)
if err != nil {
log.G(ctx).WithError(err).WithField("id", id).Warn("failed to clean up after shim disconnected")
}
if _, err := rt.Get(ctx, id); err != nil {
// Task was never started or was already successfully deleted
// No need to publish events
return
}

And the runc-shim doesn't persist the exit code of container in bundle dir. The runc-shim always use 137 as exit code even if the container exit code is zero. For example, #4769.

I think we should define the shim.Delete binary call with more detail in https://github.com/containerd/containerd/blob/main/runtime/v2/README.md#delete: The shim implementation should ensure that the exit code value from shim.Delete binary call should be aligned with shim.Delete API call.

There is one question we need to confirm:

  • Is it allowed to send the same event multiple times?

If so and the shim.Delete binary call can returns the real exit code, the retriable task.Delete will be like:

func (s *shimTask) delete(ctx context.Context, sandboxed bool, removeTask func(ctx context.Context, id string)) (*runtime.Exit, error) {
	response, shimErr := s.task.Delete(ctx, &task.DeleteRequest{ID: s.ID() })
	if shimErr != nil {
		log.G(ctx).WithField("id", s.ID()).WithError(shimErr).Debug("failed to delete task")
		if !errors.Is(shimErr, ttrpc.ErrClosed) {
			shimErr = errdefs.FromGRPC(shimErr)
			if !errdefs.IsNotFound(shimErr) {
				return nil, shimErr
			}
		}
	}

	
	if err := s.waitShutdown(ctx); err != nil {
		return err
	}

	if err := s.ShimInstance.Delete(ctx); err != nil {
		return err
	}

        // send the exit and delete events
	
	removeTask(ctx, s.ID())

	return &runtime.Exit{
		Status:    response.ExitStatus,
		Timestamp: protobuf.FromTimestamp(response.ExitedAt),
		Pid:       response.Pid,
	}, nil
}

If it's not accecpted to send the same event multiple time, I think we should implement the event system in runtime v2, instead of shim server.

Additional context

I think the moby/moby should not rely on the event to change task status. moby should have relist function to ensure the status is not out-of-date.

And IMO, the shim.Wait API call is better than task event. We don't miss any event based on task.Wait, for instance, CRI plugin.

moby/moby#46212

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions