Skip to content

Conversation

@halaney
Copy link
Contributor

@halaney halaney commented Aug 20, 2025

Backport various overlayfs related user namespace fixes and improvements

This backports various fixes related to the following issues:

Issue #12048
Issue #12139

These patches make:

1. Makes far less bind mounts happen per overlayfs, greatly improving
   performance which was shown to be devastating enough to kill the node
2. Retries on umount errors (a common containerd pattern) for the temporary
   bind mounts to prevent leaking mounts on long running containerd daemons

These are great improvements to the stability of containerd when using
user namespaces + overlayfs.

halaney and others added 5 commits August 20, 2025 16:39
Per layer idmap'ed bind mounts are costly to performance, as shown in
[0]. Each one requires taking various kernel locks, and each one shows
up in the host's mount table leading to some components like systemd
processing all these temporary mounts unnecessarily.

Let's instead go ahead and idmap the common directory of all the
layers to achieve the same effect. Now instead of being a function of
the number of layers, its a constant idmap per overlayfs! This can have
a big impact. For example, imagine running 100 containers at once, each
with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to
just 100. In reality both the shim and containerd proper do this, so its
actually double that!

[0]: containerd#12048 (comment)
Signed-off-by: Andrew Halaney <[email protected]>
(cherry picked from commit 6e9b6ea)
Signed-off-by: Andrew Halaney <[email protected]>
doPrepareIDMappedOverlay() can return nil as the cleanup function. As
the function mandates for the callback to be called, even when errors
are returned, we end-up calling the nil function that of course causes:

        panic: runtime error: invalid memory address or nil pointer dereference

The containerd daemon continues to run fine, though.

With the current structure of always calling the cleanup function, we
would either need to check which error it is, to call it only on
specific errors (ugly) or return a "stub function" that doesn't do
anything on those cases (so the call works). None of these options seem
very nice.

Let's just switch the logic to a pattern more common on go: only need to
cleanup if it returned fine. This makes it easier for callers to not do
the wrong thing.

For this change we need to do the cleanup when returning errors on the
function itself, so the user doesn't need to do it for us.

Signed-off-by: Rodrigo Campos <[email protected]>
(cherry picked from commit 7a19c94)
Signed-off-by: Andrew Halaney <[email protected]>
Before this patch, the cleanup was unconditionally trying an unmount,
which is not needed if the mount never succeeded in the first place.
This causes logs of failed stuff, that should never be there.
Also, one function was creating the tmpDir and the nested one removing
it (in the cleanup function), which complicates the reasoning about
not leaking resources.

This patch on one hand moves the creation of the tmpDir into the
function that will remove it later and renames the parameter. On the
other hand, it also separates the cleanup function in two functions:
cleanDir() and cleanMount(). Each one will clean the directory or the
mount, only if needed, and a new cleanup function that just calls
cleanDir() and cleanMount() is returned as the cleanup function handler.

Because we now create the tmp directory inside the function, we need to
adjust the test: the directory passed as param now will exist, but it
shoild be empty. Therefore, we change the os.Stat() to an os.Remove().
If it's not empty (the cleanup didn't work), the remove will fail.

Signed-off-by: Rodrigo Campos <[email protected]>
(cherry picked from commit dd7fe0b)
Signed-off-by: Andrew Halaney <[email protected]>
Signed-off-by: Rodrigo Campos <[email protected]>
(cherry picked from commit 27ba690)
Signed-off-by: Andrew Halaney <[email protected]>
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <[email protected]>
(cherry picked from commit da3dc1e)
Signed-off-by: Andrew Halaney <[email protected]>
@k8s-ci-robot
Copy link

Hi @halaney. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@halaney halaney changed the title Ahalaney/overlayfs idmap backports 2.1 Backport various overlayfs related user namespace fixes and improvements Aug 20, 2025
@halaney halaney marked this pull request as ready for review August 20, 2025 22:10
@dosubot dosubot bot added the area/runtime Runtime label Aug 20, 2025
@AkihiroSuda AkihiroSuda changed the title Backport various overlayfs related user namespace fixes and improvements [release/2.1] Backport various overlayfs related user namespace fixes and improvements Aug 20, 2025
Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Aug 22, 2025
@mxpv mxpv merged commit c8c4575 into containerd:release/2.1 Aug 22, 2025
93 of 94 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Aug 22, 2025
@austinvazquez austinvazquez changed the title [release/2.1] Backport various overlayfs related user namespace fixes and improvements [release/2.1] Fix overlayfs issues related to user namespace Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants