Skip to content

core/mount: Fix doPrepareIDMappedOverlay cleanup function#12114

Merged
mxpv merged 4 commits intocontainerd:mainfrom
rata:rootfs-cleanup-fixes
Aug 20, 2025
Merged

core/mount: Fix doPrepareIDMappedOverlay cleanup function#12114
mxpv merged 4 commits intocontainerd:mainfrom
rata:rootfs-cleanup-fixes

Conversation

@rata
Copy link
Copy Markdown
Contributor

@rata rata commented Jul 17, 2025

This PR fixes:

  • an issue where we can panic the go runtime by running a function that is nil if we take some error paths (containerd daemon continues to run fine, though)
  • the cleanup function was also adding noise to the logs on error paths (printing we failed to delete a directory, although we never created it, etc.)
  • The cleanup function didn't retry unmounting and could end up leaking mounts (containerd leaks many <root>/tmpmounts/ovl-idmapped* mounts until restart #12139)

While we are there, I took the opportunity to add tests for the cleanup function and test these issues, as well as previous issues we encounter, are fixed.

Please see individual commits for more details. I think it's simpler to review by checking each commit too.

PR #5890 introduced the issues in the cleanup function.

Fixes: #12139

@k8s-ci-robot
Copy link
Copy Markdown

Hi @rata. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rata rata force-pushed the rootfs-cleanup-fixes branch from 2e10ae6 to 937ac52 Compare July 17, 2025 14:45
@austinvazquez
Copy link
Copy Markdown
Member

/ok-to-test

@rata
Copy link
Copy Markdown
Contributor Author

rata commented Jul 17, 2025

All green except fedora with an unrelated failure. Pushing again to re-run the tests

@rata
Copy link
Copy Markdown
Contributor Author

rata commented Jul 17, 2025

@AkihiroSuda @fuweid can you take a look?

@rata rata force-pushed the rootfs-cleanup-fixes branch from 4e945b9 to e6705a2 Compare July 17, 2025 15:50
@fuweid fuweid self-requested a review July 17, 2025 18:13
Copy link
Copy Markdown
Contributor

@halaney halaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall, I like the centralization of who makes/deletes the tmpdir.

Comment thread core/mount/mount_linux_test.go Outdated
Copy link
Copy Markdown
Contributor

@halaney halaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve after testing again, but similar patch in my tree has worked already so I think its good to go as well.

Comment thread core/mount/mount_linux.go Outdated
Copy link
Copy Markdown
Contributor

@halaney halaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve after testing again, but similar patch in my tree has worked already so I think its good to go as well.

Leaked mounts, be gone! :P When the umount fails due to EBUSY the second one succeeds for me.

@rata rata force-pushed the rootfs-cleanup-fixes branch 3 times, most recently from 8a280f0 to 09694d9 Compare August 11, 2025 19:52
@rata rata changed the title core/mount: Fix doPrepareIDMappedOverlay cleanup function core/mount: Fix doPrepareIDMappedOverlay cleanup function and retry temp idmap unmounts Aug 11, 2025
@rata rata force-pushed the rootfs-cleanup-fixes branch from 09694d9 to 456ec54 Compare August 11, 2025 20:46
@rata rata changed the title core/mount: Fix doPrepareIDMappedOverlay cleanup function and retry temp idmap unmounts core/mount: Fix doPrepareIDMappedOverlay cleanup function Aug 11, 2025
@rata rata force-pushed the rootfs-cleanup-fixes branch 3 times, most recently from 56f4467 to 4da6b42 Compare August 12, 2025 19:06
@rata rata force-pushed the rootfs-cleanup-fixes branch from ca86140 to 20826e5 Compare August 14, 2025 21:16
@rata
Copy link
Copy Markdown
Contributor Author

rata commented Aug 14, 2025

Tests are green! (failures were unrelated to this PR)

doPrepareIDMappedOverlay() can return nil as the cleanup function. As
the function mandates for the callback to be called, even when errors
are returned, we end-up calling the nil function that of course causes:

        panic: runtime error: invalid memory address or nil pointer dereference

The containerd daemon continues to run fine, though.

With the current structure of always calling the cleanup function, we
would either need to check which error it is, to call it only on
specific errors (ugly) or return a "stub function" that doesn't do
anything on those cases (so the call works). None of these options seem
very nice.

Let's just switch the logic to a pattern more common on go: only need to
cleanup if it returned fine. This makes it easier for callers to not do
the wrong thing.

For this change we need to do the cleanup when returning errors on the
function itself, so the user doesn't need to do it for us.

Signed-off-by: Rodrigo Campos <[email protected]>
@rata
Copy link
Copy Markdown
Contributor Author

rata commented Aug 18, 2025

Rebased and all tests are green. @fuweid @AkihiroSuda can you please take a look?

Comment thread core/mount/mount_unix.go Outdated
cleanup()

_, err = os.Stat(remountsLocation)
err = os.Remove(remountsLocation)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add more information why we need to change this?

The original idea is to verify that remountsLocation should be deleted after called cleanup, if there is no failpoint.
But in this change, you manually remove it. we couldn't figure it out who removes it.

Copy link
Copy Markdown
Contributor Author

@rata rata Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explained in the commit message that does that change :)

But it's because now that we move the os.MkdirTemp() from prepareIDMappedOverlay() to doPrepareIDMappedOverlay(). This makes it simpler to reason and not leak dirs (before it was created in prepareIDMappedOverlay() but removed by the cleanup of doPrepareIDMappedOverlay(), although not always as you couldn't always call that cleanup function. It was messy).

But now that everything is done by that function, we expect this directory to be empty in the "all is fine" case. Therefore, the os.Remove should work if it's empty. If we are in the "err returned" case, we expect the remove to return an error, as there should be directories inside it. This is what the if tc.injectUmountFault below is doing.

This is what I wrote in the commit msg, let me know if it's not clear:

This patch on one hand moves the creation of the tmpDir into the
function that will remove it later and renames the parameter. On the
other hand, it also separates the cleanup function in two functions:
cleanDir() and cleanMount(). Each one will clean the directory or the
mount, only if needed, and a new cleanup function that just calls
cleanDir() and cleanMount() is returned as the cleanup function handler.

Because we now create the tmp directory inside the function, we need to
adjust the test: the directory passed as param now will exist, but it
shoild be empty. Therefore, we change the os.Stat() to an os.Remove().
If it's not empty (the cleanup didn't work), the remove will fail.

Does it make sense?

Comment thread core/mount/mount_linux.go Outdated
Comment thread core/mount/mount_linux_test.go Outdated
{
name: "cleanup callback short failure",
callbackFailure: true,
callbackFailureDuration: unmountRetryDelay * 2, // Duration of 2x unmountRetryDelay means it will fail for about two retries.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on my experience, the CI env is not stable and the CI provider could be overcommitted to run multiple jobs in one node. Once that node is busy, it's easy to be flaky. I don't know how to handle it. may we can implement specific umountWithRetry for IdMapping (it could have more retry) so that we can verify that retry logic.

Copy link
Copy Markdown
Contributor Author

@rata rata Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, great point. I created a channel that we only send in tests and only when opt-in for this "visibility".

Not sure the complexity is worth it, though, I think removing the tests might be simpler.

What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify, I removed those tests. Here is a diff in another branch that shows what I did: rata/containerd@rootfs-cleanup-fixes...rata:containerd:rootfs-cleanup-fixes-unmount-tests

I guess we can tackle the test in a follow-up PR once we agree on a strategy?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can handle it in a follow-up~ Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid how do you want to handle this? The diff in the link is a WIP (more things are needed), but it's not really trivial.
I've checked quickly also internal/failpoint, but it's not clear it will be useful for this. Do you know?

I'm not convinced that is worth writing a test, as it will be handled by the mount manager "soon", the wrapper today retries and it's quite a simple function, I don't think other retries are being tested and when you restart containerd any leaked mount is cleaned.

What do you think?

Comment thread core/mount/mount_linux.go
rata added 2 commits August 19, 2025 14:58
Before this patch, the cleanup was unconditionally trying an unmount,
which is not needed if the mount never succeeded in the first place.
This causes logs of failed stuff, that should never be there.
Also, one function was creating the tmpDir and the nested one removing
it (in the cleanup function), which complicates the reasoning about
not leaking resources.

This patch on one hand moves the creation of the tmpDir into the
function that will remove it later and renames the parameter. On the
other hand, it also separates the cleanup function in two functions:
cleanDir() and cleanMount(). Each one will clean the directory or the
mount, only if needed, and a new cleanup function that just calls
cleanDir() and cleanMount() is returned as the cleanup function handler.

Because we now create the tmp directory inside the function, we need to
adjust the test: the directory passed as param now will exist, but it
shoild be empty. Therefore, we change the os.Stat() to an os.Remove().
If it's not empty (the cleanup didn't work), the remove will fail.

Signed-off-by: Rodrigo Campos <[email protected]>
@rata rata force-pushed the rootfs-cleanup-fixes branch 5 times, most recently from 0275aec to b3ed66c Compare August 19, 2025 14:24
@rata rata requested a review from fuweid August 19, 2025 14:31
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <[email protected]>
@rata rata force-pushed the rootfs-cleanup-fixes branch from b3ed66c to da3dc1e Compare August 19, 2025 15:38
Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation Bot moved this from Needs Triage to Review In Progress in Pull Request Review Aug 20, 2025
@mxpv mxpv added this pull request to the merge queue Aug 20, 2025
Merged via the queue into containerd:main with commit 3df1782 Aug 20, 2025
93 of 96 checks passed
@github-project-automation github-project-automation Bot moved this from Review In Progress to Done in Pull Request Review Aug 20, 2025
@rata rata deleted the rootfs-cleanup-fixes branch August 20, 2025 09:56
@fuweid fuweid added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch labels Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Runtime cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch kind/bug ok-to-test size/L

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

containerd leaks many <root>/tmpmounts/ovl-idmapped* mounts until restart

7 participants