cri: close leak container's io when restart containerd by ningmingxiao · Pull Request #11389 · containerd/containerd

ningmingxiao · 2025-02-14T14:13:35Z

If system is busy, shim is busy container status is unknown k8s will create many containers in to one pod,which cause containerd panic(we set default 10000 to debug.SetMaxThreads(20000) still panic )
when restart containerd

Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: runtime: program exceeds 20000-thread limit
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: fatal error: thread exhaustion
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: runtime stack:
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: runtime.throw({0x1a432b2?, 0x7fb3545f2f78?})
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7fb3545f2f38 sp=0x7fb3545f2f08 pc=0x439e5d
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: runtime.checkmcount()
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/runtime/proc.go:790 +0x8c fp=0x7fb3545f2f60 sp=0x7fb3545f2f38 pc=0x43dd8c
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: runtime.mReserveID()
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/runtime/proc.go:806 +0x36 fp=0x7fb3545f2f88 sp=0x7fb3545f2f60 pc=0x43ddd6

[[email protected]@LIN-FB738BFD367 op-containers-containerd]$ cat 1.log |grep openFifo|wc -l
819
many openfifo panic.

Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: syscall.openat(0x17436c0?, {0xc00c4313c8?, 0xc0010e8010?}, 0x9?, 0x0)
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/syscall/zsyscall_linux_amd64.go:83 +0x94 fp=0xc010771e80 sp=0xc010771e08 pc=0x4c7914
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: syscall.Open(...)
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/syscall/syscall_linux.go:272
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: os.openFileNolog({0xc00c4313c8, 0x12}, 0x0, 0x0)
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/os/file_unix.go:245 +0x9b fp=0xc010771ec8 sp=0xc010771e80 pc=0x4f629b
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: os.OpenFile({0xc00c4313c8, 0x12}, 0x0, 0x10ee1b0?)
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/os/file.go:326 +0x45 fp=0xc010771f00 sp=0xc010771ec8 pc=0x4f3f45
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: github.com/containerd/fifo.openFifo.func2()
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/root/rpmbuild/BUILD/containerd.io-1.7.6/_build/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:138 +0xc5 fp=0xc010771fe0 sp=0xc010771f00 pc=0xd50a85
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: runtime.goexit()
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc010771fe8 sp=0xc010771fe0 pc=0x471721
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: created by github.com/containerd/fifo.openFifo
Feb 10 20:34:09 paas-controller-0-0 containerd[22730]: #011/root/rpmbuild/BUILD/containerd.io-1.7.6/_build/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:131 +0x3be

may be fix #9113 and
#10515

but I can't find a good way to understand how " ... is reserved for ..." happend.

panic happened(20:34:09) after containerd is restart(20:03:44).

Feb 10 20:03:44 paas-controller-0-0 containerd[22730]: time="2025-02-10T20:03:44.251063233+08:00" level=info msg="containerd successfully booted in 3.622669s"

k8s-ci-robot · 2025-02-14T14:13:45Z

Hi @ningmingxiao. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fuweid

Please use defer to close IO

fuweid

LGTM on green

ningmingxiao · 2025-02-18T14:03:18Z

can this pr be merged? @fuweid

fuweid · 2025-02-18T14:45:45Z

can this pr be merged? @fuweid

The rule of merge is to have two approvals at least. :) please wait for reviewers. Thanks

ningmingxiao · 2025-02-18T16:44:22Z

cc @mikebrow

we need to handle rollback reservation first

ningmingxiao · 2025-02-20T02:26:32Z

two question:

do we need just reserve biggest attemp container ?
see

func makeContainerName(c *runtime.ContainerMetadata, s *runtime.PodSandboxMetadata) string {
	return strings.Join([]string{
		c.Name,      // 0: container name
		s.Name,      // 1: pod name
		s.Namespace, // 2: pod namespace
		s.Uid,       // 3: pod uid
		strconv.FormatUint(uint64(c.Attempt), 10), // 4: attempt number of creating the container
	}, nameDelimiter)
}

If let Reserve run before Add old version k8s have chance to see conflict containers, but new version k8s won't see them( since they are not added into store). containerd have to delete them.
@fuweid @mikebrow

fuweid · 2025-02-20T03:08:13Z

do we need just reserve biggest attemp container ?

No, we don't. Old, exited containers may still exist. And they are valid.

If let Reserve run before Add old version k8s have chance to see conflict containers, but new version k8s won't see them( since they are not added into store). containerd have to delete them.

If I understand correctly, you're saying that ListContainers can return all containers, including those with the same name. Yes, that's correct. With my suggestion, ListContainers will not return leaky containers that have duplicate names. And we can't close IO.

I don't know how to reproduce containers with same attempt in k8s case.
I think we should hold this pull request until we have more detail on that.

/hold

Signed-off-by: ningmingxiao <[email protected]>

ningmingxiao · 2025-02-20T10:55:45Z

I find many log "failed to reserve container name" happened after containerd booted successfully.
so it may not happened when recover.

panic happened(20:34:09) after containerd is restart(20:03:44).

Feb 10 20:03:44 paas-controller-0-0 containerd[22730]: time="2025-02-10T20:03:44.251063233+08:00" level=info msg="containerd successfully booted in 3.622669s"

panic reason is that containerd doesn't receive delete request. many containers wait to be started.

lsof -p pidof containerd

container 628887 root *071u     FIFO               0,25       0t0   6956077 /run/containerd/io.containerd.grpc.v1.cri/containers/2834d95a0a5126a88aad328d03bea35f793c4d474a7f46fe148948f25ec7c1e6/io/704010651/2834d95a0a5126a88aad328d03bea35f793c4d474a7f46fe148948f25ec7c1e6-stdout
container 628887 root *075u     FIFO               0,25       0t0   6956080 /run/containerd/io.containerd.grpc.v1.cri/containers/3469913a25af73297aaa4c0da5dca4ad7cd14d52c63964c093299a321fd703dd/io/2614958862/3469913a25af73297aaa4c0da5dca4ad7cd14d52c63964c093299a321fd703dd-stdout
container 628887 root *078u     FIFO               0,25       0t0   6956083 /run/containerd/io.containerd.grpc.v1.cri/containers/24dec3be7e6cf5e061bef5c00025bb37295e654a8bfca08f3cdda6af0f60e9c3/io/2076536548/24dec3be7e6cf5e061bef5c00025bb37295e654a8bfca08f3cdda6af0f60e9c3-stdout

10124 stdout and stderror is hold by containerd.
[[email protected]@LIN-FB738BFD367 op-containers-containerd]$ cat lsof.log |grep stdout|wc -l
10124

fuweid · 2025-02-21T02:13:15Z

panic reason is that containerd doesn't receive delete request. many containers wait to be started.

Is it caused by office kubelet release? I mean that it's not the your internal version.

For failed to reserve container name error during recovery, this issue explained one case #7247 .

ningmingxiao · 2025-02-21T03:08:12Z

we use internal k8s matained by other team kubernetes/kubernetes#130331

k8s-ci-robot · 2025-04-01T00:12:52Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-actions · 2025-06-30T00:14:48Z

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 7 days unless new comments are made or the stale label is removed.

github-actions · 2025-07-07T00:15:04Z

This PR was closed because it has been stalled for 7 days with no activity.

thameezb · 2025-08-14T20:17:57Z

Can this be reopened?
Hitting this issue quite frequently now

k8s-ci-robot added needs-ok-to-test size/XS labels Feb 14, 2025

ningmingxiao mentioned this pull request Feb 14, 2025

cri:delete conflicting container before create container in sandbox to fix program exceeds 10000-thread limit #11383

Closed

ningmingxiao force-pushed the leak_io2 branch from 5d73e8e to ccab1c4 Compare February 14, 2025 14:22

ningmingxiao changed the title ~~close leak container's io~~ cri:close leak container's io when restart containerd Feb 14, 2025

ningmingxiao force-pushed the leak_io2 branch 2 times, most recently from 76b89f0 to 31244a1 Compare February 14, 2025 14:25

ningmingxiao mentioned this pull request Feb 18, 2025

restart:delete leak io dir #11379

Closed

fuweid requested changes Feb 18, 2025

View reviewed changes

Comment thread internal/cri/server/restart.go Outdated

ningmingxiao force-pushed the leak_io2 branch from 31244a1 to 46464c2 Compare February 18, 2025 03:25

k8s-ci-robot added size/S and removed size/XS labels Feb 18, 2025

fuweid reviewed Feb 18, 2025

View reviewed changes

Comment thread internal/cri/server/restart.go Outdated

ningmingxiao force-pushed the leak_io2 branch from 46464c2 to fba1cb6 Compare February 18, 2025 03:45

fuweid previously approved these changes Feb 18, 2025

View reviewed changes

fuweid added ok-to-test and removed needs-ok-to-test labels Feb 18, 2025

ningmingxiao force-pushed the leak_io2 branch from fba1cb6 to d7dc513 Compare February 18, 2025 05:51

fuweid changed the title ~~cri:close leak container's io when restart containerd~~ cri: close leak container's io when restart containerd Feb 18, 2025

mikebrow reviewed Feb 18, 2025

View reviewed changes

Comment thread internal/cri/server/restart.go

ningmingxiao force-pushed the leak_io2 branch from d7dc513 to d6fda53 Compare February 19, 2025 03:26

k8s-ci-robot added size/XS and removed size/S labels Feb 19, 2025

ningmingxiao force-pushed the leak_io2 branch from d6fda53 to 89549c4 Compare February 19, 2025 07:50

k8s-ci-robot removed the size/XS label Feb 19, 2025

k8s-ci-robot added the size/S label Feb 19, 2025

ningmingxiao force-pushed the leak_io2 branch 6 times, most recently from dbe79a8 to 27a6956 Compare February 20, 2025 02:20

fuweid requested changes Feb 20, 2025

View reviewed changes

Comment thread internal/cri/server/restart.go

Comment thread internal/cri/server/restart.go Outdated

Comment thread internal/cri/server/restart.go Outdated

fuweid added the do-not-merge/work-in-progress label Feb 20, 2025

cri:close leak container's io when restart containerd

5145f88

Signed-off-by: ningmingxiao <[email protected]>

ningmingxiao force-pushed the leak_io2 branch from 27a6956 to 5145f88 Compare February 20, 2025 03:22

k8s-ci-robot removed the do-not-merge/work-in-progress label Feb 20, 2025

ningmingxiao mentioned this pull request Feb 21, 2025

removeOldestN doesn't make sure container is deleted completly kubernetes/kubernetes#130331

Closed

utam0k mentioned this pull request Feb 27, 2025

[release/2.0] Prepare release notes for v2.0.3 #11443

Merged

k8s-ci-robot added the needs-rebase label Apr 1, 2025

github-actions Bot added the Stale label Jun 30, 2025

github-actions Bot closed this Jul 7, 2025

github-project-automation Bot moved this from Needs Update to Done in Pull Request Review Jul 7, 2025

Conversation

ningmingxiao commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 14, 2025

Uh oh!

fuweid left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

ningmingxiao commented Feb 18, 2025

Uh oh!

fuweid commented Feb 18, 2025

Uh oh!

ningmingxiao commented Feb 18, 2025

Uh oh!

Uh oh!

ningmingxiao commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fuweid commented Feb 20, 2025

Uh oh!

ningmingxiao commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Feb 21, 2025

Uh oh!

ningmingxiao commented Feb 21, 2025

Uh oh!

k8s-ci-robot commented Apr 1, 2025

Uh oh!

github-actions Bot commented Jun 30, 2025

Uh oh!

github-actions Bot commented Jul 7, 2025

Uh oh!

thameezb commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ningmingxiao commented Feb 14, 2025 •

edited

Loading

fuweid left a comment •

edited

Loading

ningmingxiao commented Feb 20, 2025 •

edited

Loading

ningmingxiao commented Feb 20, 2025 •

edited

Loading