Correct clean-up of Windows Layers in testsuite#5133
Correct clean-up of Windows Layers in testsuite#5133estesp merged 2 commits intocontainerd:masterfrom
Conversation
|
Hi @TBBle. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Build succeeded.
|
10e1f78 to
4d2d65c
Compare
|
Build succeeded.
|
4d2d65c to
f887038
Compare
|
Build succeeded.
|
f887038 to
0255ffe
Compare
|
Build succeeded.
|
0255ffe to
a6c2aca
Compare
|
Build succeeded.
|
a6c2aca to
e4abc14
Compare
|
Build succeeded.
|
e4abc14 to
8d7e735
Compare
|
Build succeeded.
|
8d7e735 to
554e6be
Compare
|
Build succeeded.
|
cpuguy83
left a comment
There was a problem hiding this comment.
I'm not familiar with the windows snapshotter layout and couldn't really determine it just looking at the code, so assuming its a base dir with directories with integer names.
Other than that the Unprepare/Deactivate calls look similar to some cleanup code in the moby windows graphdriver, so 👍
There was a problem hiding this comment.
Returning an error doesn't seem right here because this stops cleanup.
There was a problem hiding this comment.
I... kind-of want to stop cleanup and fail, because this means (unless Atoi has other failure cases) that the Windows snapshotter behaviour has changed, and this code wasn't updated. I'd want that to fail CI.
There was a problem hiding this comment.
To be clear, the only circumstances I could hit this with the current code-base is an rm-<digits> directory, created during func (s *snapshotter) Remove, and having fallen through either "Failed to rename after failed commit" or "Failed to remove root filesystem". In that case, it seems quite likely we will fail to call cleanupWCOWLayer on it anyway.
There was a problem hiding this comment.
I could refactor this to collect all the directories, best-effort sort them (I'd have to strip the rm- prefix so that I remove child layers before parent layers), and then record any cleanupWCOWLayer failure but still try it on all of them. The outcome would be that if a child layer is stuck, it and all its parent layers would report errors during clean-up, and we'd probably stumble over panic that prompted this PR in the first place, as we'd be trying to remove a parent layer while its child is activated.
There was a problem hiding this comment.
Huh, and rebasing to master, I managed to trigger exactly this code-path on CI:
2021-03-20T00:25:28.7332957Z failed to remove test root dir failed to cleanup WCOW layers in C:\Program Files\containerd\root-test\io.containerd.snapshotter.v1.windows\snapshots: strconv.Atoi: parsing "rm-26": invalid syntax
2021-03-20T00:25:28.8327339Z exit status 1
2021-03-20T00:25:28.8328816Z FAIL github.com/containerd/containerd/integration/client 358.389s
2021-03-20T00:25:28.9091215Z mingw32-make: *** [Makefile:181: integration] Error 1
The layer itself doesn't seem to have caused any failures in the tests, so my guess this hit the "Failed to remove root filesystem" Warnf and that doesn't turn into a test failure (or get logged anywhere...). Annoyingly, there is a hcsshim::DeactivateLayer span, but not a hcsshim::DestroyLayer span, so I can't definitively tie this to a particular test.
I suspect this represents a gap in the containerd Windows snapshotter, as it's never cleaning up left-over rm- layer directories... Particularly because if I'm reading the test logs correctly, the directory in question came from "Integration 1", but the cleanup failure was the end of "Integration 2".
There was a problem hiding this comment.
Thinking about it, it might be reasonable to, during this walk, just os.RemoveAll or cleanupWCOWLayer any rm-* layers we encounter during the Walk. My main concern is that if a stack of layers fails removal, and ends up all being renamed to rm-*, we'll walk them in the wrong order and hit the panic again.
There was a problem hiding this comment.
Upshot for me is that if this routine fails as-is, we should have failed earlier because the tests have left the Snapshotter in an inconsistent state, and I'm not sure if trying to force the clean-up harder is productive at that point, as it seems like clean-up will fail for the same reasons that left the Snapshotter state inconsistent.
b0a4d01 to
fc1131d
Compare
|
Looks like the Linux CI pipeline is borked on master... Windows is passing, except when we somehow trigger a case which suggests (to me) a test-suite failure that went uncaught. |
kevpar
left a comment
There was a problem hiding this comment.
LGTM. Thank you for working on this. :)
|
Sorry for the timing of your last rebase--you caught the partial day where we had an issue in main with the test image for Linux; if you rebase now the Linux failures should disappear. |
e5968e2 to
aa9ffb2
Compare
|
Three clean runs on Windows (rebase to master, comment improvement, and
As noted in microsoft/hcsshim#961, an OS-level fix should change the triggering situation (mis-ordered unmounts) from a panic to a failure, but still better to not cause the situation in the first place. ^_^ |
This ensures that we do not trigger assertions inside HCS by tring to call hcsshim.DestroyLayer on the parent of a currently-activated layer. It also deactivates the layers before deletion, to ensure we trigger or avert file-in-use failures due to leftover state from the tests with more detail than 'destroy failed'. Signed-off-by: Paul "TBBle" Hampson <[email protected]>
Signed-off-by: Paul "TBBle" Hampson <[email protected]>
aa9ffb2 to
1fd3d12
Compare
A fix for
sys.ForceRemoveAllon Windows: Clean up individual layers from the windows snapshotter using wclayer APIsERROR_DEV_NOT_EXIST), Deactivate, Destroy.os.RemoveAllthe root dir like we do on non-Windows.Tested with a now-reverted hack to reproduce the issue reliably on CI. It reproduced on the fourth of eight repeats of the Windows Integration test. Note that due to 30 minute timeout, it would be cancelled during run 5. I'm not sure why the suite takes 4 minutes now, it used to take 1 minute when running in parallel, based on some of the logs in #4924 (comment).
Fixes: #4924
I have cherry-picked the fix into #4419, since that PR enables the full Snapshot test suite on Windows, and hence reproduces #4924 almost every time, including cases where the tests themselves have failed and may have left the system in a bad state, i.e. files locked open preventing clean-up. This change should remove the
panicwe see in this case, but may still fail to clean up with nice, readable errors. That said, even when #4419 fails, with this change it's cleaning up without error, so I suspect its "file still open" problems are transitory, or the clean-up routine here works correctly to unstick them.