Skip to content

Only idmap once per overlayfs, not per layer#12092

Merged
AkihiroSuda merged 1 commit intocontainerd:mainfrom
halaney:ahalaney/use-one-idmap-per-overlayfs
Aug 17, 2025
Merged

Only idmap once per overlayfs, not per layer#12092
AkihiroSuda merged 1 commit intocontainerd:mainfrom
halaney:ahalaney/use-one-idmap-per-overlayfs

Conversation

@halaney
Copy link
Copy Markdown
Contributor

@halaney halaney commented Jul 14, 2025

Per layer idmap'ed bind mounts are costly to performance, as shown in
0. Let's instead go ahead and idmap the common directory of all the
layers to achieve the same effect. Now instead of being a function of
the number of layers, its a constant idmap per overlayfs!

@k8s-ci-robot
Copy link
Copy Markdown

Hi @halaney. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@halaney halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch 3 times, most recently from 0699c6d to 238cbb7 Compare July 14, 2025 20:35
@halaney halaney marked this pull request as ready for review July 14, 2025 21:19
@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Jul 14, 2025

@rata here's the first stab at this, lemme know what you think!

@dosubot dosubot Bot added the area/runtime Runtime label Jul 14, 2025
@AkihiroSuda
Copy link
Copy Markdown
Member

/ok-to-test

Copy link
Copy Markdown
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@halaney awesome, thanks a lot! I left some commets, I'll try the code later :)

Comment thread core/mount/mount_linux.go Outdated
Comment thread core/mount/mount_linux.go Outdated
Comment thread core/mount/mount_linux.go Outdated
Comment on lines +398 to +399
commondir, err := getCommonDirectory(dirs)
if err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unrelated change to the patchset. I'd leave this out, for a different PR. This way, we can backport this if needed to other containerd releases, without unnecesary changes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll flip the commit order / logic to be:

  1. idmap once per overlayfs (and add getCommonDirectory() in there)
  2. Use getCommonDirectory() in compactLowerdirOption()

That way 1 can be cherry-picked more independently and we still get the reuse of that function. Both bits of code are sort of doing the same thing, so it feels a bit wrong to dupe the functionality and not reuse.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The automated machinery backports PR, so I'd do that in another PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't know that! Thanks, I've dropped that bit out for now.

Comment thread core/mount/mount_linux.go
Comment thread core/mount/mount_linux.go
@halaney halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from 238cbb7 to 14a44d0 Compare July 15, 2025 18:35
Copy link
Copy Markdown
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@halaney awesome, this looks great! I left two nits, after those fixes I'll approve :-D

Comment thread core/mount/mount_linux.go
Comment thread core/mount/mount_linux.go
Comment thread core/mount/mount_linux.go Outdated
Comment thread core/mount/mount_linux.go Outdated
@rata
Copy link
Copy Markdown
Contributor

rata commented Jul 16, 2025

Also, please try to push again if CI is failing but the error seems unrelated to this PR. We can't merge until CI is green

@rata
Copy link
Copy Markdown
Contributor

rata commented Jul 16, 2025

@halaney just curious, can you stress test this to see if systemd or the system in general is affected? Now there is one extra mount per rootfs, that is visible to systemd, so it might cause some extra load. It would be great if you can verify the behaviour with this PR too :)

@rata
Copy link
Copy Markdown
Contributor

rata commented Jul 16, 2025

I've also validated this locally, it works fine. @halaney great work, let's push it to the finish line :)

@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Jul 16, 2025

@rata -- sorry was trying to gather data before addressing latest comments! The data below I'm back on an older kernel (compared to prior data, mostly just for my notes) -- 6.5 based.

Here's a flamegraph running 100 pods without this patch applied, each pods 55 layers. Note how much time is spent in doPrepareIDMappedOverlay(). Also note that compared to #12048 's data (which was on 2.0.5) things are actually a bit better, but its still not good. I think the difference with 2.0.5 is that the bind mounts here are now read only which skips some kernel locks.
image

Here's a flamegraph of the same test, with this patch applied. You can't even see doPrepareIDMappedOverlay() anymore in the view. I've highlighted in purple the Mount() call (would have done doPrepare... but its so small you still can't notice it).
image

I've scaled the test up to run more pods (which dies pretty spectacularly without this patch and goes through fine with it).

There's other things that don't scale great like the pinned user namespace not getting cleaned up until kubelet removes the sandbox, but that's a different subject and is nowhere near as impactful as this is. i.e. these staying around is a bit troublesome if you churn thru pods super quickly, but like I said that's an entirely different subject to look at and not nearly as impactful:

$ mount
(...)
nsfs on /run/containerd/io.containerd.grpc.v1.cri/sandboxes/dd4ced4b0d1f13735203f3d338b42aa6649fd65f0ea88ff144170bb91a581e3a/pinned-namespaces/user type nsfs (rw)
nsfs on /run/containerd/io.containerd.grpc.v1.cri/sandboxes/ace057993e94944dc7d9422a5524c045ede056ac154362ba5113826ac49d66a3/pinned-namespaces/user type nsfs (rw)

@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Jul 16, 2025

@halaney just curious, can you stress test this to see if systemd or the system in general is affected? Now there is one extra mount per rootfs, that is visible to systemd, so it might cause some extra load. It would be great if you can verify the behaviour with this PR too :)

Just to be clear, the above test should have validated general system "ok-ness" :D

Also, the main reason I am quoting this -- the "extra mount per rootfs" is sort of misleading to anyone looking just at this PR. Prior, we did a temporary mount per layer. Now its just one temporary mount per overlayfs, which can be a massive improvement if your images have a good number of layers. In both cases those temporary mounts are cleaned up shortly after the overlayfs is mounted.

What @rata is referring to when claiming there's an extra mount is the approach mentioned over here that we were originally exploring to improve this situation, which avoids mounting the idmap'ed lowerdirs at all and instead supplies them to the kernel as an fd. That requires a fairly new kernel, etc, and still would benefit from the work done here to reduce the number of operations from a function of # layers to a constant per overlayfs.

If anyone has recs for improving the commit message to better captures this + add some data, I'm all ears, please nitpick away!

@halaney halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from 14a44d0 to cc5d39e Compare July 16, 2025 16:03
@rata
Copy link
Copy Markdown
Contributor

rata commented Jul 16, 2025

Perfect, so with this patch the systemd perf issue is gone and no kubelet timeouts from #12048! Great! :)

@halaney halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from cc5d39e to e6dc743 Compare July 16, 2025 16:06
@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Jul 16, 2025

Hmm, CI is still not happy. I don't think that's my fault, but lemme dig a little more and if not see if i can rekick those.

EDIT: Couldn't find a way to rekick, just going to rebase and rerun all of CI.

@halaney halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from e6dc743 to 18591b4 Compare July 16, 2025 19:07
Copy link
Copy Markdown
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR! Great improvement, the code is clean and simple and the tests are meaningful :)

Besides CI being green, I have validated locally that all seems good, using images with one layer (debian) and images with several layers (python).

This also closes #12048, I'd like to backport it to 2.0 and 2.1

@AkihiroSuda @fuweid can you please take a look?

Comment thread core/mount/mount_linux_test.go
@rata
Copy link
Copy Markdown
Contributor

rata commented Jul 22, 2025

@AkihiroSuda friendly ping?

@rata
Copy link
Copy Markdown
Contributor

rata commented Jul 23, 2025

@fuweid friendly ping? It would be great to merge this so we can backport to 2.0 and 2.1 too

@fuweid
Copy link
Copy Markdown
Member

fuweid commented Jul 24, 2025

It looks good actually. Just want to test it in my local before voting. Thanks

@mbaynton
Copy link
Copy Markdown
Contributor

Neat idea.

@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Aug 4, 2025

@fuweid -- sorry to pester, but any ideas on when you'll get to try this out?

@fuweid
Copy link
Copy Markdown
Member

fuweid commented Aug 7, 2025

Sorry @halaney for late reply.
kind of busy recently. Will test it tomorrow and give feedback to you. 🙏

Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Let's bake it in main branch first and then apply it into supported releases.
Thanks!

@fuweid fuweid requested review from AkihiroSuda and dmcgowan August 8, 2025 02:27
Copy link
Copy Markdown
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tihs LGTM, left a minor neat.

Comment thread core/mount/mount_linux.go Outdated
@halaney halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from 6a231b9 to ddb7228 Compare August 12, 2025 19:55
Per layer idmap'ed bind mounts are costly to performance, as shown in
[0]. Each one requires taking various kernel locks, and each one shows
up in the host's mount table leading to some components like systemd
processing all these temporary mounts unnecessarily.

Let's instead go ahead and idmap the common directory of all the
layers to achieve the same effect. Now instead of being a function of
the number of layers, its a constant idmap per overlayfs! This can have
a big impact. For example, imagine running 100 containers at once, each
with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to
just 100. In reality both the shim and containerd proper do this, so its
actually double that!

[0]: containerd#12048 (comment)
Signed-off-by: Andrew Halaney <[email protected]>
@halaney halaney force-pushed the ahalaney/use-one-idmap-per-overlayfs branch from ddb7228 to 6e9b6ea Compare August 12, 2025 21:28
@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Aug 15, 2025

@AkihiroSuda @dmcgowan -- when you get some time can you review this? Would love to get this and #12114 merged

@github-project-automation github-project-automation Bot moved this from Needs Reviewers to Review In Progress in Pull Request Review Aug 17, 2025
@AkihiroSuda AkihiroSuda added this pull request to the merge queue Aug 17, 2025
Merged via the queue into containerd:main with commit 0649e9b Aug 17, 2025
50 checks passed
@github-project-automation github-project-automation Bot moved this from Review In Progress to Done in Pull Request Review Aug 17, 2025
@rata
Copy link
Copy Markdown
Contributor

rata commented Aug 20, 2025

Can we backport this to 2.0 and 2.1?

@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Aug 20, 2025

@rata thanks for the nudge, I'll prepare a backport with:

  1. The RO idmap changes
  2. This change
  3. Your clean up changes

To address all the problems those PRs addressed. Seems that acceptable procedure based on backporting docs here

### Backporting

@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Aug 20, 2025

@rata @fuweid backports over here:
2.0: #12223
2.1: #12222

@rata
Copy link
Copy Markdown
Contributor

rata commented Aug 21, 2025

The bot can backport PRs. But a manual PR SGTM too :)

@halaney
Copy link
Copy Markdown
Contributor Author

halaney commented Aug 21, 2025

The bot can backport PRs. But a manual PR SGTM too :)

Too many (2-3, 2.0 only needs one commit from one of the PRs) separate but related PRs to try and untangle. I'm a simple man, git-foo is easier than bot foo!

@fuweid fuweid added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch labels Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Runtime cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch ok-to-test size/L

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants