-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[release/2.1] Fix overlayfs issues related to user namespace #12222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/2.1] Fix overlayfs issues related to user namespace #12222
Conversation
Per layer idmap'ed bind mounts are costly to performance, as shown in [0]. Each one requires taking various kernel locks, and each one shows up in the host's mount table leading to some components like systemd processing all these temporary mounts unnecessarily. Let's instead go ahead and idmap the common directory of all the layers to achieve the same effect. Now instead of being a function of the number of layers, its a constant idmap per overlayfs! This can have a big impact. For example, imagine running 100 containers at once, each with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to just 100. In reality both the shim and containerd proper do this, so its actually double that! [0]: containerd#12048 (comment) Signed-off-by: Andrew Halaney <[email protected]> (cherry picked from commit 6e9b6ea) Signed-off-by: Andrew Halaney <[email protected]>
doPrepareIDMappedOverlay() can return nil as the cleanup function. As
the function mandates for the callback to be called, even when errors
are returned, we end-up calling the nil function that of course causes:
panic: runtime error: invalid memory address or nil pointer dereference
The containerd daemon continues to run fine, though.
With the current structure of always calling the cleanup function, we
would either need to check which error it is, to call it only on
specific errors (ugly) or return a "stub function" that doesn't do
anything on those cases (so the call works). None of these options seem
very nice.
Let's just switch the logic to a pattern more common on go: only need to
cleanup if it returned fine. This makes it easier for callers to not do
the wrong thing.
For this change we need to do the cleanup when returning errors on the
function itself, so the user doesn't need to do it for us.
Signed-off-by: Rodrigo Campos <[email protected]>
(cherry picked from commit 7a19c94)
Signed-off-by: Andrew Halaney <[email protected]>
Before this patch, the cleanup was unconditionally trying an unmount, which is not needed if the mount never succeeded in the first place. This causes logs of failed stuff, that should never be there. Also, one function was creating the tmpDir and the nested one removing it (in the cleanup function), which complicates the reasoning about not leaking resources. This patch on one hand moves the creation of the tmpDir into the function that will remove it later and renames the parameter. On the other hand, it also separates the cleanup function in two functions: cleanDir() and cleanMount(). Each one will clean the directory or the mount, only if needed, and a new cleanup function that just calls cleanDir() and cleanMount() is returned as the cleanup function handler. Because we now create the tmp directory inside the function, we need to adjust the test: the directory passed as param now will exist, but it shoild be empty. Therefore, we change the os.Stat() to an os.Remove(). If it's not empty (the cleanup didn't work), the remove will fail. Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit dd7fe0b) Signed-off-by: Andrew Halaney <[email protected]>
Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit 27ba690) Signed-off-by: Andrew Halaney <[email protected]>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <[email protected]> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <[email protected]>
|
Hi @halaney. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
rata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Backport various overlayfs related user namespace fixes and improvements
This backports various fixes related to the following issues:
These patches make:
These are great improvements to the stability of containerd when using
user namespaces + overlayfs.