[release/2.1] Fix overlayfs issues related to user namespace #12222

halaney · 2025-08-20T21:40:53Z

Backport various overlayfs related user namespace fixes and improvements

This backports various fixes related to the following issues:

Issue #12048
Issue #12139

These patches make:

1. Makes far less bind mounts happen per overlayfs, greatly improving
   performance which was shown to be devastating enough to kill the node
2. Retries on umount errors (a common containerd pattern) for the temporary
   bind mounts to prevent leaking mounts on long running containerd daemons

These are great improvements to the stability of containerd when using
user namespaces + overlayfs.

Per layer idmap'ed bind mounts are costly to performance, as shown in [0]. Each one requires taking various kernel locks, and each one shows up in the host's mount table leading to some components like systemd processing all these temporary mounts unnecessarily. Let's instead go ahead and idmap the common directory of all the layers to achieve the same effect. Now instead of being a function of the number of layers, its a constant idmap per overlayfs! This can have a big impact. For example, imagine running 100 containers at once, each with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to just 100. In reality both the shim and containerd proper do this, so its actually double that! [0]: containerd#12048 (comment) Signed-off-by: Andrew Halaney <[email protected]> (cherry picked from commit 6e9b6ea) Signed-off-by: Andrew Halaney <[email protected]>

doPrepareIDMappedOverlay() can return nil as the cleanup function. As the function mandates for the callback to be called, even when errors are returned, we end-up calling the nil function that of course causes: panic: runtime error: invalid memory address or nil pointer dereference The containerd daemon continues to run fine, though. With the current structure of always calling the cleanup function, we would either need to check which error it is, to call it only on specific errors (ugly) or return a "stub function" that doesn't do anything on those cases (so the call works). None of these options seem very nice. Let's just switch the logic to a pattern more common on go: only need to cleanup if it returned fine. This makes it easier for callers to not do the wrong thing. For this change we need to do the cleanup when returning errors on the function itself, so the user doesn't need to do it for us. Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit 7a19c94) Signed-off-by: Andrew Halaney <[email protected]>

Before this patch, the cleanup was unconditionally trying an unmount, which is not needed if the mount never succeeded in the first place. This causes logs of failed stuff, that should never be there. Also, one function was creating the tmpDir and the nested one removing it (in the cleanup function), which complicates the reasoning about not leaking resources. This patch on one hand moves the creation of the tmpDir into the function that will remove it later and renames the parameter. On the other hand, it also separates the cleanup function in two functions: cleanDir() and cleanMount(). Each one will clean the directory or the mount, only if needed, and a new cleanup function that just calls cleanDir() and cleanMount() is returned as the cleanup function handler. Because we now create the tmp directory inside the function, we need to adjust the test: the directory passed as param now will exist, but it shoild be empty. Therefore, we change the os.Stat() to an os.Remove(). If it's not empty (the cleanup didn't work), the remove will fail. Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit dd7fe0b) Signed-off-by: Andrew Halaney <[email protected]>

Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit 27ba690) Signed-off-by: Andrew Halaney <[email protected]>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <[email protected]>

k8s-ci-robot · 2025-08-20T21:41:03Z

Hi @halaney. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rata

LGTM

halaney and others added 5 commits August 20, 2025 16:39

core/mount: Test cleanup of DoPrepareIDMappedOverlay()

1f51d2d

Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit 27ba690) Signed-off-by: Andrew Halaney <[email protected]>

github-project-automation bot added this to Pull Request Review Aug 20, 2025

k8s-ci-robot added the do-not-merge/work-in-progress label Aug 20, 2025

github-project-automation bot moved this to Needs Triage in Pull Request Review Aug 20, 2025

k8s-ci-robot added size/L needs-ok-to-test labels Aug 20, 2025

halaney changed the title ~~Ahalaney/overlayfs idmap backports 2.1~~ Backport various overlayfs related user namespace fixes and improvements Aug 20, 2025

halaney marked this pull request as ready for review August 20, 2025 22:10

k8s-ci-robot removed the do-not-merge/work-in-progress label Aug 20, 2025

dosubot bot added the area/runtime Runtime label Aug 20, 2025

This was referenced Aug 20, 2025

Only idmap once per overlayfs, not per layer #12092

Merged

containerd leaks many <root>/tmpmounts/ovl-idmapped* mounts until restart #12139

Closed

AkihiroSuda changed the title ~~Backport various overlayfs related user namespace fixes and improvements~~ [release/2.1] Backport various overlayfs related user namespace fixes and improvements Aug 20, 2025

rata approved these changes Aug 21, 2025

View reviewed changes

estesp approved these changes Aug 21, 2025

View reviewed changes

mxpv approved these changes Aug 22, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Aug 22, 2025

mxpv merged commit c8c4575 into containerd:release/2.1 Aug 22, 2025
93 of 94 checks passed

github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Aug 22, 2025

austinvazquez mentioned this pull request Nov 5, 2025

[release/2.1] Prepare release notes for v2.1.5 #12483

Merged

austinvazquez changed the title ~~[release/2.1] Backport various overlayfs related user namespace fixes and improvements~~ [release/2.1] Fix overlayfs issues related to user namespace Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release/2.1] Fix overlayfs issues related to user namespace #12222

[release/2.1] Fix overlayfs issues related to user namespace #12222

Uh oh!

halaney commented Aug 20, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Aug 20, 2025

Uh oh!

rata left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[release/2.1] Fix overlayfs issues related to user namespace #12222

[release/2.1] Fix overlayfs issues related to user namespace #12222

Uh oh!

Conversation

halaney commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 20, 2025

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

halaney commented Aug 20, 2025 •

edited

Loading