Only idmap once per overlayfs, not per layer#12092
Only idmap once per overlayfs, not per layer#12092AkihiroSuda merged 1 commit intocontainerd:mainfrom
Conversation
|
Hi @halaney. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
0699c6d to
238cbb7
Compare
|
@rata here's the first stab at this, lemme know what you think! |
|
/ok-to-test |
| commondir, err := getCommonDirectory(dirs) | ||
| if err != nil { |
There was a problem hiding this comment.
This seems like an unrelated change to the patchset. I'd leave this out, for a different PR. This way, we can backport this if needed to other containerd releases, without unnecesary changes
There was a problem hiding this comment.
I'll flip the commit order / logic to be:
- idmap once per overlayfs (and add getCommonDirectory() in there)
- Use getCommonDirectory() in compactLowerdirOption()
That way 1 can be cherry-picked more independently and we still get the reuse of that function. Both bits of code are sort of doing the same thing, so it feels a bit wrong to dupe the functionality and not reuse.
There was a problem hiding this comment.
The automated machinery backports PR, so I'd do that in another PR
There was a problem hiding this comment.
Ah, didn't know that! Thanks, I've dropped that bit out for now.
238cbb7 to
14a44d0
Compare
|
Also, please try to push again if CI is failing but the error seems unrelated to this PR. We can't merge until CI is green |
|
@halaney just curious, can you stress test this to see if systemd or the system in general is affected? Now there is one extra mount per rootfs, that is visible to systemd, so it might cause some extra load. It would be great if you can verify the behaviour with this PR too :) |
|
I've also validated this locally, it works fine. @halaney great work, let's push it to the finish line :) |
|
@rata -- sorry was trying to gather data before addressing latest comments! The data below I'm back on an older kernel (compared to prior data, mostly just for my notes) -- 6.5 based. Here's a flamegraph running 100 pods without this patch applied, each pods 55 layers. Note how much time is spent in Here's a flamegraph of the same test, with this patch applied. You can't even see I've scaled the test up to run more pods (which dies pretty spectacularly without this patch and goes through fine with it). There's other things that don't scale great like the pinned user namespace not getting cleaned up until kubelet removes the sandbox, but that's a different subject and is nowhere near as impactful as this is. i.e. these staying around is a bit troublesome if you churn thru pods super quickly, but like I said that's an entirely different subject to look at and not nearly as impactful: |
Just to be clear, the above test should have validated general system "ok-ness" :D Also, the main reason I am quoting this -- the "extra mount per rootfs" is sort of misleading to anyone looking just at this PR. Prior, we did a temporary mount per layer. Now its just one temporary mount per overlayfs, which can be a massive improvement if your images have a good number of layers. In both cases those temporary mounts are cleaned up shortly after the overlayfs is mounted. What @rata is referring to when claiming there's an extra mount is the approach mentioned over here that we were originally exploring to improve this situation, which avoids mounting the idmap'ed lowerdirs at all and instead supplies them to the kernel as an fd. That requires a fairly new kernel, etc, and still would benefit from the work done here to reduce the number of operations from a function of # layers to a constant per overlayfs. If anyone has recs for improving the commit message to better captures this + add some data, I'm all ears, please nitpick away! |
14a44d0 to
cc5d39e
Compare
|
Perfect, so with this patch the systemd perf issue is gone and no kubelet timeouts from #12048! Great! :) |
cc5d39e to
e6dc743
Compare
|
Hmm, CI is still not happy. I don't think that's my fault, but lemme dig a little more and if not see if i can rekick those. EDIT: Couldn't find a way to rekick, just going to rebase and rerun all of CI. |
e6dc743 to
18591b4
Compare
rata
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the PR! Great improvement, the code is clean and simple and the tests are meaningful :)
Besides CI being green, I have validated locally that all seems good, using images with one layer (debian) and images with several layers (python).
This also closes #12048, I'd like to backport it to 2.0 and 2.1
@AkihiroSuda @fuweid can you please take a look?
|
@AkihiroSuda friendly ping? |
|
@fuweid friendly ping? It would be great to merge this so we can backport to 2.0 and 2.1 too |
|
It looks good actually. Just want to test it in my local before voting. Thanks |
|
Neat idea. |
|
@fuweid -- sorry to pester, but any ideas on when you'll get to try this out? |
|
Sorry @halaney for late reply. |
fuweid
left a comment
There was a problem hiding this comment.
LGTM
Let's bake it in main branch first and then apply it into supported releases.
Thanks!
rata
left a comment
There was a problem hiding this comment.
Tihs LGTM, left a minor neat.
6a231b9 to
ddb7228
Compare
Per layer idmap'ed bind mounts are costly to performance, as shown in [0]. Each one requires taking various kernel locks, and each one shows up in the host's mount table leading to some components like systemd processing all these temporary mounts unnecessarily. Let's instead go ahead and idmap the common directory of all the layers to achieve the same effect. Now instead of being a function of the number of layers, its a constant idmap per overlayfs! This can have a big impact. For example, imagine running 100 containers at once, each with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to just 100. In reality both the shim and containerd proper do this, so its actually double that! [0]: containerd#12048 (comment) Signed-off-by: Andrew Halaney <[email protected]>
ddb7228 to
6e9b6ea
Compare
|
@AkihiroSuda @dmcgowan -- when you get some time can you review this? Would love to get this and #12114 merged |
|
Can we backport this to 2.0 and 2.1? |
|
The bot can backport PRs. But a manual PR SGTM too :) |
Too many (2-3, 2.0 only needs one commit from one of the PRs) separate but related PRs to try and untangle. I'm a simple man, git-foo is easier than bot foo! |


Per layer idmap'ed bind mounts are costly to performance, as shown in
0. Let's instead go ahead and idmap the common directory of all the
layers to achieve the same effect. Now instead of being a function of
the number of layers, its a constant idmap per overlayfs!