Skip to content

removeOldestN doesn't make sure container is deleted completly #130331

@ningmingxiao

Description

@ningmingxiao

What happened?

func (cgc *containerGC) removeOldestN(ctx context.Context, containers []containerGCInfo, toRemove int) []containerGCInfo {
	// Remove from oldest to newest (last to first).
	numToKeep := len(containers) - toRemove
	if numToKeep > 0 {
		sort.Sort(byCreated(containers))
	}
	for i := len(containers) - 1; i >= numToKeep; i-- {
		if containers[i].unknown {
			// Containers in known state could be running, we should try
			// to stop it before removal.
			id := kubecontainer.ContainerID{
				Type: cgc.manager.runtimeName,
				ID:   containers[i].id,
			}
			message := "Container is in unknown state, try killing it before removal"
			if err := cgc.manager.killContainer(ctx, nil, id, containers[i].name, message, reasonUnknown, nil, nil); err != nil {
				klog.ErrorS(err, "Failed to stop container", "containerID", containers[i].id)
				continue
			}
		}
		if err := cgc.manager.removeContainer(ctx, containers[i].id); err != nil {
			klog.ErrorS(err, "Failed to remove container", "containerID", containers[i].id)
		}
	}

	// Assume we removed the containers so that we're not too aggressive.
	return containers[:numToKeep]
}

many logs like this

E0210 19:52:56.897622   38171 remote_runtime.go:347] "StartContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="29d1d59735a2eb389ca29f20bc85d002f90dd509ccd0c1c16c5506e07184cfaa"
I0210 19:52:57.303537   38171 scope.go:143] "RemoveContainer" containerID="29d1d59735a2eb389ca29f20bc85d002f90dd509ccd0c1c16c5506e07184cfaa"
I0210 21:37:42.650459   36915 scope.go:143] "RemoveContainer" containerID="29d1d59735a2eb389ca29f20bc85d002f90dd509ccd0c1c16c5506e07184cfaa"
E0210 21:37:42.650790   36915 remote_runtime.go:439] "ContainerStatus from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="29d1d59735a2eb389ca29f20bc85d002f90dd509ccd0c1c16c5506e07184cfaa"
E0210 21:37:42.650804   36915 kuberuntime_gc.go:151] "Failed to remove container" err="failed to get container status \"29d1d59735a2eb389ca29f20bc85d002f90dd509ccd0c1c16c5506e07184cfaa\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\"" containerID="29d1d59735a2eb389ca29f20bc85d002f90dd509ccd0c1c16c5506e07184cfaa

will cause containerd panic. program exceeds 10000-thread limit
containerd/containerd#11389

What did you expect to happen?

keep delete until cotainerd is deleted.

How can we reproduce it (as minimally and precisely as possible)?

difficult to reproduce.

Anything else we need to know?

system is busy.

Kubernetes version

Details
$ kubectl version
# paste output here

Cloud provider

Details

OS version

Details
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Details

Container runtime (CRI) and version (if applicable)

Details

Related plugins (CNI, CSI, ...) and versions (if applicable)

Details

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions