core/mount: Fix doPrepareIDMappedOverlay cleanup function#12114
core/mount: Fix doPrepareIDMappedOverlay cleanup function#12114mxpv merged 4 commits intocontainerd:mainfrom
Conversation
|
Hi @rata. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2e10ae6 to
937ac52
Compare
|
/ok-to-test |
937ac52 to
4e945b9
Compare
|
All green except fedora with an unrelated failure. Pushing again to re-run the tests |
|
@AkihiroSuda @fuweid can you take a look? |
4e945b9 to
e6705a2
Compare
halaney
left a comment
There was a problem hiding this comment.
lgtm overall, I like the centralization of who makes/deletes the tmpdir.
e6705a2 to
2be10e1
Compare
halaney
left a comment
There was a problem hiding this comment.
I'll approve after testing again, but similar patch in my tree has worked already so I think its good to go as well.
halaney
left a comment
There was a problem hiding this comment.
I'll approve after testing again, but similar patch in my tree has worked already so I think its good to go as well.
Leaked mounts, be gone! :P When the umount fails due to EBUSY the second one succeeds for me.
8a280f0 to
09694d9
Compare
09694d9 to
456ec54
Compare
56f4467 to
4da6b42
Compare
ca86140 to
20826e5
Compare
|
Tests are green! (failures were unrelated to this PR) |
doPrepareIDMappedOverlay() can return nil as the cleanup function. As
the function mandates for the callback to be called, even when errors
are returned, we end-up calling the nil function that of course causes:
panic: runtime error: invalid memory address or nil pointer dereference
The containerd daemon continues to run fine, though.
With the current structure of always calling the cleanup function, we
would either need to check which error it is, to call it only on
specific errors (ugly) or return a "stub function" that doesn't do
anything on those cases (so the call works). None of these options seem
very nice.
Let's just switch the logic to a pattern more common on go: only need to
cleanup if it returned fine. This makes it easier for callers to not do
the wrong thing.
For this change we need to do the cleanup when returning errors on the
function itself, so the user doesn't need to do it for us.
Signed-off-by: Rodrigo Campos <[email protected]>
20826e5 to
660b655
Compare
|
Rebased and all tests are green. @fuweid @AkihiroSuda can you please take a look? |
| cleanup() | ||
|
|
||
| _, err = os.Stat(remountsLocation) | ||
| err = os.Remove(remountsLocation) |
There was a problem hiding this comment.
could you please add more information why we need to change this?
The original idea is to verify that remountsLocation should be deleted after called cleanup, if there is no failpoint.
But in this change, you manually remove it. we couldn't figure it out who removes it.
There was a problem hiding this comment.
This is explained in the commit message that does that change :)
But it's because now that we move the os.MkdirTemp() from prepareIDMappedOverlay() to doPrepareIDMappedOverlay(). This makes it simpler to reason and not leak dirs (before it was created in prepareIDMappedOverlay() but removed by the cleanup of doPrepareIDMappedOverlay(), although not always as you couldn't always call that cleanup function. It was messy).
But now that everything is done by that function, we expect this directory to be empty in the "all is fine" case. Therefore, the os.Remove should work if it's empty. If we are in the "err returned" case, we expect the remove to return an error, as there should be directories inside it. This is what the if tc.injectUmountFault below is doing.
This is what I wrote in the commit msg, let me know if it's not clear:
This patch on one hand moves the creation of the tmpDir into the
function that will remove it later and renames the parameter. On the
other hand, it also separates the cleanup function in two functions:
cleanDir() and cleanMount(). Each one will clean the directory or the
mount, only if needed, and a new cleanup function that just calls
cleanDir() and cleanMount() is returned as the cleanup function handler.
Because we now create the tmp directory inside the function, we need to
adjust the test: the directory passed as param now will exist, but it
shoild be empty. Therefore, we change the os.Stat() to an os.Remove().
If it's not empty (the cleanup didn't work), the remove will fail.
Does it make sense?
| { | ||
| name: "cleanup callback short failure", | ||
| callbackFailure: true, | ||
| callbackFailureDuration: unmountRetryDelay * 2, // Duration of 2x unmountRetryDelay means it will fail for about two retries. |
There was a problem hiding this comment.
based on my experience, the CI env is not stable and the CI provider could be overcommitted to run multiple jobs in one node. Once that node is busy, it's easy to be flaky. I don't know how to handle it. may we can implement specific umountWithRetry for IdMapping (it could have more retry) so that we can verify that retry logic.
There was a problem hiding this comment.
Ohh, great point. I created a channel that we only send in tests and only when opt-in for this "visibility".
Not sure the complexity is worth it, though, I think removing the tests might be simpler.
What do you think?
There was a problem hiding this comment.
To simplify, I removed those tests. Here is a diff in another branch that shows what I did: rata/containerd@rootfs-cleanup-fixes...rata:containerd:rootfs-cleanup-fixes-unmount-tests
I guess we can tackle the test in a follow-up PR once we agree on a strategy?
There was a problem hiding this comment.
We can handle it in a follow-up~ Thanks!
There was a problem hiding this comment.
@fuweid how do you want to handle this? The diff in the link is a WIP (more things are needed), but it's not really trivial.
I've checked quickly also internal/failpoint, but it's not clear it will be useful for this. Do you know?
I'm not convinced that is worth writing a test, as it will be handled by the mount manager "soon", the wrapper today retries and it's quite a simple function, I don't think other retries are being tested and when you restart containerd any leaked mount is cleaned.
What do you think?
Before this patch, the cleanup was unconditionally trying an unmount, which is not needed if the mount never succeeded in the first place. This causes logs of failed stuff, that should never be there. Also, one function was creating the tmpDir and the nested one removing it (in the cleanup function), which complicates the reasoning about not leaking resources. This patch on one hand moves the creation of the tmpDir into the function that will remove it later and renames the parameter. On the other hand, it also separates the cleanup function in two functions: cleanDir() and cleanMount(). Each one will clean the directory or the mount, only if needed, and a new cleanup function that just calls cleanDir() and cleanMount() is returned as the cleanup function handler. Because we now create the tmp directory inside the function, we need to adjust the test: the directory passed as param now will exist, but it shoild be empty. Therefore, we change the os.Stat() to an os.Remove(). If it's not empty (the cleanup didn't work), the remove will fail. Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Rodrigo Campos <[email protected]>
0275aec to
b3ed66c
Compare
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <[email protected]>
b3ed66c to
da3dc1e
Compare
This PR fixes:
While we are there, I took the opportunity to add tests for the cleanup function and test these issues, as well as previous issues we encounter, are fixed.
Please see individual commits for more details. I think it's simpler to review by checking each commit too.
PR #5890 introduced the issues in the cleanup function.
Fixes: #12139