Only idmap once per overlayfs, not per layer by halaney · Pull Request #12092 · containerd/containerd

halaney · 2025-07-14T20:25:02Z

Per layer idmap'ed bind mounts are costly to performance, as shown in
0. Let's instead go ahead and idmap the common directory of all the
layers to achieve the same effect. Now instead of being a function of
the number of layers, its a constant idmap per overlayfs!

k8s-ci-robot · 2025-07-14T20:25:12Z

Hi @halaney. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

halaney · 2025-07-14T21:19:30Z

@rata here's the first stab at this, lemme know what you think!

AkihiroSuda · 2025-07-15T06:11:42Z

/ok-to-test

rata

@halaney awesome, thanks a lot! I left some commets, I'll try the code later :)

rata · 2025-07-15T10:47:05Z

+	commondir, err := getCommonDirectory(dirs)
+	if err != nil {


This seems like an unrelated change to the patchset. I'd leave this out, for a different PR. This way, we can backport this if needed to other containerd releases, without unnecesary changes

I'll flip the commit order / logic to be:

idmap once per overlayfs (and add getCommonDirectory() in there)

Use getCommonDirectory() in compactLowerdirOption()

That way 1 can be cherry-picked more independently and we still get the reuse of that function. Both bits of code are sort of doing the same thing, so it feels a bit wrong to dupe the functionality and not reuse.

The automated machinery backports PR, so I'd do that in another PR

Ah, didn't know that! Thanks, I've dropped that bit out for now.

rata

@halaney awesome, this looks great! I left two nits, after those fixes I'll approve :-D

rata · 2025-07-16T10:25:00Z

Also, please try to push again if CI is failing but the error seems unrelated to this PR. We can't merge until CI is green

rata · 2025-07-16T10:26:39Z

@halaney just curious, can you stress test this to see if systemd or the system in general is affected? Now there is one extra mount per rootfs, that is visible to systemd, so it might cause some extra load. It would be great if you can verify the behaviour with this PR too :)

rata · 2025-07-16T14:16:03Z

I've also validated this locally, it works fine. @halaney great work, let's push it to the finish line :)

halaney · 2025-07-16T15:16:18Z

@rata -- sorry was trying to gather data before addressing latest comments! The data below I'm back on an older kernel (compared to prior data, mostly just for my notes) -- 6.5 based.

Here's a flamegraph running 100 pods without this patch applied, each pods 55 layers. Note how much time is spent in doPrepareIDMappedOverlay(). Also note that compared to #12048 's data (which was on 2.0.5) things are actually a bit better, but its still not good. I think the difference with 2.0.5 is that the bind mounts here are now read only which skips some kernel locks.

Here's a flamegraph of the same test, with this patch applied. You can't even see doPrepareIDMappedOverlay() anymore in the view. I've highlighted in purple the Mount() call (would have done doPrepare... but its so small you still can't notice it).

I've scaled the test up to run more pods (which dies pretty spectacularly without this patch and goes through fine with it).

There's other things that don't scale great like the pinned user namespace not getting cleaned up until kubelet removes the sandbox, but that's a different subject and is nowhere near as impactful as this is. i.e. these staying around is a bit troublesome if you churn thru pods super quickly, but like I said that's an entirely different subject to look at and not nearly as impactful:

$ mount
(...)
nsfs on /run/containerd/io.containerd.grpc.v1.cri/sandboxes/dd4ced4b0d1f13735203f3d338b42aa6649fd65f0ea88ff144170bb91a581e3a/pinned-namespaces/user type nsfs (rw)
nsfs on /run/containerd/io.containerd.grpc.v1.cri/sandboxes/ace057993e94944dc7d9422a5524c045ede056ac154362ba5113826ac49d66a3/pinned-namespaces/user type nsfs (rw)

halaney · 2025-07-16T15:33:31Z

@halaney just curious, can you stress test this to see if systemd or the system in general is affected? Now there is one extra mount per rootfs, that is visible to systemd, so it might cause some extra load. It would be great if you can verify the behaviour with this PR too :)

Just to be clear, the above test should have validated general system "ok-ness" :D

Also, the main reason I am quoting this -- the "extra mount per rootfs" is sort of misleading to anyone looking just at this PR. Prior, we did a temporary mount per layer. Now its just one temporary mount per overlayfs, which can be a massive improvement if your images have a good number of layers. In both cases those temporary mounts are cleaned up shortly after the overlayfs is mounted.

What @rata is referring to when claiming there's an extra mount is the approach mentioned over here that we were originally exploring to improve this situation, which avoids mounting the idmap'ed lowerdirs at all and instead supplies them to the kernel as an fd. That requires a fairly new kernel, etc, and still would benefit from the work done here to reduce the number of operations from a function of # layers to a constant per overlayfs.

If anyone has recs for improving the commit message to better captures this + add some data, I'm all ears, please nitpick away!

rata · 2025-07-16T16:05:17Z

Perfect, so with this patch the systemd perf issue is gone and no kubelet timeouts from #12048! Great! :)

halaney · 2025-07-16T18:31:07Z

Hmm, CI is still not happy. I don't think that's my fault, but lemme dig a little more and if not see if i can rekick those.

EDIT: Couldn't find a way to rekick, just going to rebase and rerun all of CI.

rata

LGTM. Thanks for the PR! Great improvement, the code is clean and simple and the tests are meaningful :)

Besides CI being green, I have validated locally that all seems good, using images with one layer (debian) and images with several layers (python).

This also closes #12048, I'd like to backport it to 2.0 and 2.1

@AkihiroSuda @fuweid can you please take a look?

rata · 2025-07-22T14:28:43Z

@AkihiroSuda friendly ping?

rata · 2025-07-23T12:54:04Z

@fuweid friendly ping? It would be great to merge this so we can backport to 2.0 and 2.1 too

fuweid · 2025-07-24T15:40:58Z

It looks good actually. Just want to test it in my local before voting. Thanks

mbaynton · 2025-07-25T15:00:12Z

Neat idea.

halaney · 2025-08-04T16:04:50Z

@fuweid -- sorry to pester, but any ideas on when you'll get to try this out?

fuweid · 2025-08-07T03:23:43Z

Sorry @halaney for late reply.
kind of busy recently. Will test it tomorrow and give feedback to you. 🙏

fuweid

LGTM

Let's bake it in main branch first and then apply it into supported releases.
Thanks!

rata

Tihs LGTM, left a minor neat.

Per layer idmap'ed bind mounts are costly to performance, as shown in [0]. Each one requires taking various kernel locks, and each one shows up in the host's mount table leading to some components like systemd processing all these temporary mounts unnecessarily. Let's instead go ahead and idmap the common directory of all the layers to achieve the same effect. Now instead of being a function of the number of layers, its a constant idmap per overlayfs! This can have a big impact. For example, imagine running 100 containers at once, each with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to just 100. In reality both the shim and containerd proper do this, so its actually double that! [0]: containerd#12048 (comment) Signed-off-by: Andrew Halaney <[email protected]>

halaney · 2025-08-15T15:43:13Z

@AkihiroSuda @dmcgowan -- when you get some time can you review this? Would love to get this and #12114 merged

rata · 2025-08-20T13:14:49Z

Can we backport this to 2.0 and 2.1?

halaney · 2025-08-20T15:33:31Z

@rata thanks for the nudge, I'll prepare a backport with:

The RO idmap changes
This change
Your clean up changes

To address all the problems those PRs addressed. Seems that acceptable procedure based on backporting docs here

containerd/RELEASES.md

Line 169 in 18e9318

### Backporting

halaney · 2025-08-20T22:13:05Z

@rata @fuweid backports over here:
2.0: #12223
2.1: #12222

rata · 2025-08-21T10:51:36Z

The bot can backport PRs. But a manual PR SGTM too :)

halaney · 2025-08-21T15:55:58Z

The bot can backport PRs. But a manual PR SGTM too :)

Too many (2-3, 2.0 only needs one commit from one of the PRs) separate but related PRs to try and untangle. I'm a simple man, git-foo is easier than bot foo!

github-project-automation Bot added this to Pull Request Review Jul 14, 2025

k8s-ci-robot added the do-not-merge/work-in-progress label Jul 14, 2025

github-project-automation Bot moved this to Needs Triage in Pull Request Review Jul 14, 2025

k8s-ci-robot added needs-ok-to-test size/L labels Jul 14, 2025

halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch 3 times, most recently from 0699c6d to 238cbb7 Compare July 14, 2025 20:35

halaney marked this pull request as ready for review July 14, 2025 21:19

k8s-ci-robot removed the do-not-merge/work-in-progress label Jul 14, 2025

dosubot Bot added the area/runtime Runtime label Jul 14, 2025

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jul 15, 2025

rata reviewed Jul 15, 2025

View reviewed changes

halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from 238cbb7 to 14a44d0 Compare July 15, 2025 18:35

rata reviewed Jul 16, 2025

View reviewed changes

Comment thread core/mount/mount_linux.go

Comment thread core/mount/mount_linux.go

Comment thread core/mount/mount_linux.go Outdated

Comment thread core/mount/mount_linux.go Outdated

halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from 14a44d0 to cc5d39e Compare July 16, 2025 16:03

halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from cc5d39e to e6dc743 Compare July 16, 2025 16:06

halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from e6dc743 to 18591b4 Compare July 16, 2025 19:07

rata approved these changes Jul 17, 2025

View reviewed changes

Comment thread core/mount/mount_linux_test.go

rata mentioned this pull request Jul 25, 2025

containerd leaks many <root>/tmpmounts/ovl-idmapped* mounts until restart #12139

Closed

fuweid approved these changes Aug 8, 2025

View reviewed changes

fuweid requested review from AkihiroSuda and dmcgowan August 8, 2025 02:27

rata approved these changes Aug 11, 2025

View reviewed changes

Comment thread core/mount/mount_linux.go Outdated

halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from 6a231b9 to ddb7228 Compare August 12, 2025 19:55

halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from ddb7228 to 6e9b6ea Compare August 12, 2025 21:28

AkihiroSuda approved these changes Aug 17, 2025

View reviewed changes

github-project-automation Bot moved this from Needs Reviewers to Review In Progress in Pull Request Review Aug 17, 2025

AkihiroSuda added this pull request to the merge queue Aug 17, 2025

Merged via the queue into containerd:main with commit 0649e9b Aug 17, 2025
50 checks passed

github-project-automation Bot moved this from Review In Progress to Done in Pull Request Review Aug 17, 2025

halaney mentioned this pull request Aug 19, 2025

starting many user namespace enabled pods at once causes bad mount performance #12048

Closed

fuweid added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch labels Aug 26, 2025

fuweid mentioned this pull request Nov 5, 2025

Prepare release notes for v2.2.0 #12473

Merged

Conversation

halaney commented Jul 14, 2025

Uh oh!

k8s-ci-robot commented Jul 14, 2025

Uh oh!

halaney commented Jul 14, 2025

Uh oh!

AkihiroSuda commented Jul 15, 2025

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rata Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

halaney Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

rata Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

halaney Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rata commented Jul 16, 2025

Uh oh!

rata commented Jul 16, 2025

Uh oh!

rata commented Jul 16, 2025

Uh oh!

halaney commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

halaney commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata commented Jul 16, 2025

Uh oh!

halaney commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rata commented Jul 22, 2025

Uh oh!

rata commented Jul 23, 2025

Uh oh!

fuweid commented Jul 24, 2025

Uh oh!

mbaynton commented Jul 25, 2025

Uh oh!

halaney commented Aug 4, 2025

Uh oh!

fuweid commented Aug 7, 2025

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

halaney commented Aug 15, 2025

Uh oh!

Uh oh!

rata commented Aug 20, 2025

Uh oh!

halaney commented Jul 16, 2025 •

edited

Loading

halaney commented Jul 16, 2025 •

edited

Loading

halaney commented Jul 16, 2025 •

edited

Loading