Skip to content

Fix Plan: failed to recover state: failed to reserve container name xxx: name xxx is reserved for xxx #11504

@fuweid

Description

@fuweid

Background

Reported in #7247

The CRI plugin always creates a checkpoint for each container and sandbox status. Checkpoints are committed using atomic writes, ensuring the JSON file remains intact. However, during a containerd restart, if the filesystem runs out of space, the CRI plugin skips existing containers. Even if the task or shim plugin manages these running containers, the CRI plugin renders them invisible to the kubelet.

Once the disk is healthy and space becomes available, the kubelet will send a request to containerd, prompting it to create a new sandbox or container with the existing attempt or restartCount. Although the patch in PR k8s#99748 ensures that restartCount is incremented, running two duplicate containers within a single pod is not expected.

Fix plan

Short-Term

Introduced new helper , like WithStatusWhenRestart, to skip no-space error.
This helper should be used in restart.go only.

+func WithStatusWhenRestart(status Status, root string) Opts {
+       return func(c *Container) error {
+               s, err := StoreStatusWhenRestart(root, status)
+               if err != nil {
+                       return err
+               }
+               c.Status = s
+               if s.Get().State() == runtime.ContainerState_CONTAINER_EXITED {
+                       c.Stop()
+               }
+               return nil
+       }
+}
+
+func StoreStatusWhenRestart(root string, status Status) (StatusStorage, error) {
+       data, err := status.encode()
+       if err != nil {
+               return nil, fmt.Errorf("failed to encode status: %w", err)
+       }
+       path := filepath.Join(root, "status")
+       if err := continuity.AtomicWriteFile(path, data, 0600); err != nil {
+               // handle it for differnt platform
+               if !errors.Is(err, syscall.ENOSPC) {
+                       return nil, fmt.Errorf("failed to checkpoint status to %q: %w", path, err)
+               }
+               // log error
+       }
+       return &statusStorage{
+               path:   path,
+               status: status,
+       }, nil
+}
+

Long-Term

The checkpoint file is used to track if that container has been started, stopped.
For instance, container A has been exited and CRI plugin will delete task via shim API and store exit code in file.
If there is no checkpoint file and no extra information to track exit code, after restarting containerd, containerd A could be regarded as created status instead of exited. The kubelet will try to restart it instead of deleting it.

We don't need extra checkpoint actually, because we have metadata plugin.
We can consider moving status as annotation in container plugin.
So, all the required information can be stored in container. It's easy to construct status as we want.

Test Plan - E2E Test

  • Setup and run containerd process on 100 MiB filesystem (provided by loopback device)
    • Create three containers in one pod
      • A: Created state
      • B: Running state
      • C: Exited State
    • Stop containerd
    • Inject failpoint to run out of space
    • Restart containerd
    • Check A/B/C container status

We should commit E2E test case first and then submit the fix patch.

NOTE

cc @mikebrow @dmcgowan
ping @yylt

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions