*: properly shutdown non-groupable shims to prevent resource leaks by fuweid · Pull Request #11916 · containerd/containerd

fuweid · 2025-05-30T04:10:25Z

Previously, to address issue #11708, PR #11793 changed containerd to always
invoke the shim binary to establish shim connections, rather than reusing the
sandbox shim. However, this change did not ensure that the Shutdown API was
called to stop the shim process.

Starting with containerd v2.0.0, the Shutdown API is only invoked for sandbox
containers (when container.SandboxID is empty). This approach works for
groupable shims, where multiple containers share a single socket address and
only require a single Shutdown call. However, for non-groupable shims, each
container requires its own Shutdown call during cleanup to avoid leaking shim
processes.

Additionally, PR #11793 introduced a corner case during upgrades:

T1: An old container-shim-runc-v2 (<=v1.7.X) is running for pod A.
T2: containerd is upgraded to v2.X.Y.
T3: A new container A-C1 is created in pod A using the new shim-runc-v2 binary.
T4: bootstrap.json indicates version:3 protocol, but it is downgraded to version:2 in memory.
T5: containerd is restarted.
T6: containerd fails to connect to A-C1.
T7: The A-C1 container is left in EXITED status in the CRI plugin.

To address this, ensure that loadShimTask downgrades to version:2 if necessary,
and always invoke the Shutdown API for each non-groupable shim during cleanup to
prevent resource leaks and handle upgrade scenarios correctly.

(Introduced by #11793)

Signed-off-by: Wei Fu [email protected]

Fixes: #11871

fuweid · 2025-05-30T04:11:45Z

ping @smira if possible, would you please verify this patch? thanks
(I think this time I cover shim leaky issue in CI. but it's shim-runc-v1. just want to double confirm this patch can work with gvisor)

updated: tested with gvisor in my local. it can fix that issue.

Previously, to address issue containerd#11708, PR containerd#11793 changed containerd to always invoke the shim binary to establish shim connections, rather than reusing the sandbox shim. However, this change did not ensure that the Shutdown API was called to stop the shim process. Starting with containerd v2.0.0, the Shutdown API is only invoked for sandbox containers (when container.SandboxID is empty). This approach works for groupable shims, where multiple containers share a single socket address and only require a single Shutdown call. However, for non-groupable shims, each container requires its own Shutdown call during cleanup to avoid leaking shim processes. Additionally, PR containerd#11793 introduced a corner case during upgrades: - T1: An old container-shim-runc-v2 (<=v1.7.X) is running for pod A. - T2: containerd is upgraded to v2.X.Y. - T3: A new container A-C1 is created in pod A using the new shim-runc-v2 binary. - T4: bootstrap.json indicates version:3 protocol, but it is downgraded to version:2 in memory. - T5: containerd is restarted. - T6: containerd fails to connect to A-C1. - T7: The A-C1 container is left in EXITED status in the CRI plugin. To address this, ensure that loadShimTask downgrades to version:2 if necessary, and always invoke the Shutdown API for each non-groupable shim during cleanup to prevent resource leaks and handle upgrade scenarios correctly. (Introduced by containerd#11793) Signed-off-by: Wei Fu <[email protected]>

fuweid · 2025-06-02T20:01:58Z

/retest

smira · 2025-06-03T16:49:13Z

ping @smira if possible, would you please verify this patch? thanks

I verified by adding https://patch-diff.githubusercontent.com/raw/containerd/containerd/pull/11916.patch and removing my previous hacky patch https://github.com/siderolabs/pkgs/blob/main/containerd/patches/11741-not-set-sandbox-id-when-use-podsandbox-type.patch - all seems to be good in the tests, thank you!

fuweid · 2025-06-03T17:57:07Z

@smira thanks for the update!

djdongjin · 2025-06-07T23:29:42Z

 		removeTask(ctx, s.ID())
 	}

+	const supportSandboxAPIVersion = 3


nit: should we move this to a package-level const and with comments about if/ when to bump up it? (also do we already have an existing const for the current API version?)

it is fixed value. I think we only need to change sandbox value for old shim. I Will add comment in follow up. Thanks for the suggestion.

fuweid · 2025-06-10T18:27:32Z

/cherry-pick release/2.1

k8s-infra-cherrypick-robot · 2025-06-10T18:28:16Z

@fuweid: new pull request created: #11971

Details

In response to this:

/cherry-pick release/2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Drop the hacky patch which was a workaround for gvisor issues. I tested the fix in 2.1.2: containerd/containerd#11916 (comment) Signed-off-by: Andrey Smirnov <[email protected]>

github-project-automation Bot added this to Pull Request Review May 30, 2025

k8s-ci-robot added the do-not-merge/work-in-progress label May 30, 2025

github-project-automation Bot moved this to Needs Triage in Pull Request Review May 30, 2025

k8s-ci-robot added the size/L label May 30, 2025

dosubot Bot added the area/runtime Runtime label May 30, 2025

fuweid force-pushed the fix-11871 branch from 9834179 to 25626dd Compare June 2, 2025 03:55

fuweid changed the title ~~[WIP] Fix 11871~~ *: properly shutdown non-groupable shims to prevent resource leaks Jun 2, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress label Jun 2, 2025

fuweid requested review from dmcgowan, mxpv and samuelkarp June 2, 2025 03:57

fuweid force-pushed the fix-11871 branch from 25626dd to 1ac97c2 Compare June 2, 2025 04:01

fuweid added ok-to-test cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch labels Jun 2, 2025

AkihiroSuda approved these changes Jun 7, 2025

View reviewed changes

djdongjin approved these changes Jun 7, 2025

View reviewed changes

github-project-automation Bot moved this from Needs Triage to Review In Progress in Pull Request Review Jun 7, 2025

fuweid added this pull request to the merge queue Jun 7, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 8, 2025

fuweid added this pull request to the merge queue Jun 8, 2025

Merged via the queue into containerd:main with commit eeb9065 Jun 8, 2025
52 checks passed

github-project-automation Bot moved this from Review In Progress to Done in Pull Request Review Jun 8, 2025

fuweid mentioned this pull request Jun 10, 2025

[release/2.1] Prepare release notes for v2.1.2 #11962

Merged

fuweid deleted the fix-11871 branch June 10, 2025 18:27

k8s-infra-cherrypick-robot mentioned this pull request Jun 10, 2025

[release/2.1] Properly shutdown non-groupable shims to prevent resource leaks #11971

Merged

austinvazquez added cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch and removed cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch labels Jun 10, 2025

fuweid mentioned this pull request Jun 11, 2025

[release/2.0] Fix incompatibility with some pre-v3 shims #11973

Merged

fuweid added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch and removed cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch labels Jun 11, 2025

smira mentioned this pull request Jun 12, 2025

feat: update containerd to 2.1.2 siderolabs/pkgs#1264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: properly shutdown non-groupable shims to prevent resource leaks#11916

*: properly shutdown non-groupable shims to prevent resource leaks#11916
fuweid merged 1 commit intocontainerd:mainfrom
fuweid:fix-11871

fuweid commented May 30, 2025 •

edited

Loading

Uh oh!

fuweid commented May 30, 2025 •

edited

Loading

Uh oh!

fuweid commented Jun 2, 2025

Uh oh!

smira commented Jun 3, 2025

Uh oh!

fuweid commented Jun 3, 2025

Uh oh!

djdongjin Jun 7, 2025

Uh oh!

fuweid Jun 7, 2025

Uh oh!

Uh oh!

Uh oh!

fuweid commented Jun 10, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

fuweid commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Jun 2, 2025

Uh oh!

smira commented Jun 3, 2025

Uh oh!

fuweid commented Jun 3, 2025

Uh oh!

djdongjin Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

fuweid Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fuweid commented Jun 10, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fuweid commented May 30, 2025 •

edited

Loading

fuweid commented May 30, 2025 •

edited

Loading