Skip to content

cmd/ctr: protect contents from GC#5692

Closed
kzys wants to merge 1 commit intocontainerd:mainfrom
kzys:import-lease
Closed

cmd/ctr: protect contents from GC#5692
kzys wants to merge 1 commit intocontainerd:mainfrom
kzys:import-lease

Conversation

@kzys
Copy link
Copy Markdown
Member

@kzys kzys commented Jul 6, 2021

"ctr image import" didn't have a lease. Due to that, GC would remove
contents during the import process.

Fixes #5690.

Signed-off-by: Kazuyoshi Kato [email protected]

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Jul 6, 2021

Build succeeded.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Jul 6, 2021

Build succeeded.

Comment thread cmd/ctr/commands/images/import.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply defer done(ctx) is enough here and what is done in most places. If lease cleanup fails due to connection it will still get cleaned up later in the GC, if it fails in the backend the errors there will be more useful. Its not expected that a CLI user must act on a lease deletion failure though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Will update.

"ctr image import" didn't have a lease. Due to that, GC would
remove contents during the import process.

Fixes containerd#5690.

Signed-off-by: Kazuyoshi Kato <[email protected]>
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Jul 6, 2021

Build succeeded.

@k8s-ci-robot
Copy link
Copy Markdown

@kzys: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-containerd-node-e2e 85309b0 link /test pull-containerd-node-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

}
defer done(ctx)

imgs, err := client.Import(ctx, r, opts...)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to already have a lease

ctx, done, err := c.WithLease(ctx)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The individual calls have it, I am curious if there is something getting garbage collected between import and unpack? Although I would expect if something was getting GC'ed before unpack, it would still end up getting GC'ed after

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double-check the imported image. If it doesn't have any labels, it could be GC'd before Unpack.

Copy link
Copy Markdown
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to already have a lease

@estesp
Copy link
Copy Markdown
Member

estesp commented Jul 9, 2021

confirmation that there seems to be some issue with content and GC on ctr import from a message in #containerd slack as of yesterday (8 July): https://cloud-native.slack.com/archives/C4RJZ9Z6Y/p1625752082294200

@claudiubelu
Copy link
Copy Markdown
Contributor

This PR doesn't seem to fix the issue on Windows at least.

@claudiubelu
Copy link
Copy Markdown
Contributor

So, I did some more investigation on this for a bit.

There is a test (TestImageLoad) that should have been able to hit this issue by now, but that test is using the docker.io/library/busybox:latest image, which has only one layer:

docker inspect busybox:latest

...
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:0fd05bf2930d72edc1aa0b9fa7e442295c2384512a32c0b1502392f5384acd81"
            ]
        },
...

And that layer gets "registered" with the content store, it is reachable with images.Walk, it won't be deleted anymore, even without the lease in cmd/ctr/commands/images/import.go.

But people have reported issues with images that have multiple layers [1][2][3]. In cmd/ctr/commands/images/import.go, there is still some processing going on, which require some of the bits that have been imported, but it's racing against the Garbage Collector because the lease was created and expired in containerd/import.go. This can results in errors like this:

PS C:\containerd> ctr.exe --address //./pipe//run/containerd-test/containerd --namespace k8s.io images import .\image.tar
unpacking docker.io/claudiubelu/busybox:1.29 (sha256:435b5572d74b64c9646803376baad711af044e2130e3567bed07b809af4ca8bf)...ctr: content digest sha256:7c3822e9f6e642ac53d1da8e3aa2f08b556d82ee8827dda1c93868cde18799d9: not found

And on the containerd logs we can see:

level=debug msg="removed content" digest="sha256:77b0672dda4a9a40f4a00b73dc21b446e2bc1b2ffb3e6f37525fc5967a971d46"
level=warning msg="content garbage collection failed" error="remove C:\\Program Files\\Git\\var\\lib\\containerd-test\\io.containerd.content.v1.content\\blobs\\sha256\\7c3822e9f6e642ac53d1da8e3aa2f08b556d82ee8827dda1c93868cde18799d9: The process cannot access the file because it is being used by another process."
level=debug msg="garbage collected" d=126.4602ms
level=debug msg="diff applied" d=501.3442ms digest="sha256:7c3822e9f6e642ac53d1da8e3aa2f08b556d82ee8827dda1c93868cde18799d9" media=application/vnd.docker.image.rootfs.diff.tar size=1376768

Extending the lease to cmd/ctr/commands/images/import.go, which is what this PR is doing helps, the error above no longer occurs, the Garbage Collector starts cleaning up after the operation ends. However, I mentioned in my previous comment that it doesn't entirely solve the problem. The operation is succesful, and the image shows up in the ctr images list, but it is unusable, you'll get errors like this when you try to use that image:

E0820 09:12:08.595787    4512 remote_runtime.go:241] CreateContainer in sandbox "b9446f40a8d090bfb212a5dbac2be0140288a4b0d298e880a06787a400cdc901" from runtime service failed: rpc error: code = NotFound desc = failed to create con
tainerd container: error unpacking image: failed to resolve rootfs: content digest sha256:151d5ba7519d51c4efa2068225a37ed40cd0072c9b835d15b64173d4f0e70735: not found

Looking through the containerd logs, it seems that the Garbage Collector deleted the sha256:151d5ba7519d51c4efa2068225a37ed40cd0072c9b835d15b64173d4f0e70735 content, which was needed by the image.

I think that the Garbage Collector doesn't see those layers as visible (contentSeen) when importing images, which is why they're getting deleted in containerd/metadata/content.go

[1] #5599 (comment)
[2] #5690
[3] https://cloud-native.slack.com/archives/C4RJZ9Z6Y/p1625752082294200

@cpuguy83
Copy link
Copy Markdown
Member

What I found is with regard to multiplatform images is the loader loads all blobs but only creates an image record for the default platform (stored on the client) unless --all-platforms is specified.
Depending on when GC happens the rest of those blobs (when --all-platforms is not specified) will be removed.

@claudiubelu
Copy link
Copy Markdown
Contributor

So, I have found the problem for this issue, at least for the Windows side.

So, someone commented that this didn't use to be a problem for containerd 1.4, but it started with containerd 1.5. Going through the containerd tags, it seems that this issue appeared between v1.5.0-rc.0 and v1.5.0-rc.1, more specifically, after this PR: #5298

That PR basically made the Windows' platforms.Default stricter, requiring the image to have the same OS Version prefix as the host.

Tar images imported through ctr import do not have any OSVersion field [1]... which means that the manifest platform information [2] won't have any OSVersion information either, which means that the containerd/import.Import's images.FilterPlatforms to filter out the image entirely, which means that the images.SetChildrenLabels doesn't get to label
any children [3], which in turn will cause the Garbage Collector to remove content related to the image.

This information is consistent with the information provided by @cpuguy83 . If ctr import --all-platforms is used, then a more lenient platforms.All matcher is used instead of the regular platforms.Default [4].

I'm not sure how to address this issue. Ideally, the OSVersion field would be included in the image spec [1], and it would be exported by docker save as well, but that could take time, and it would imply that users would have to make sure they have the latest docker. Should we assume that the imported image has the same OS Version as the host [5]? Looking at the images.FilterPlatforms, it seems that it's ok if the d.Platform is nil [6], so this wouldn't be a bad assumption, plus, it would match the behaviour as containerd 1.4. In any case, if anyone needs to import those images now, they can use ctr import --all-platforms.

I wish this issue could have been caught early on Import, and not afterwards, after the operation "succeeded", when trying to use the image to spawn a container. I think it would be a good idea to have something like images.WalkNotEmpty or something, which basically would ensure that all its handlers walked over at least one image (and return an error if it didn't). In our case, if we had it in containerd/import.Import, we would have been able to figure out that we're trying to import an image not meant for the host's platform, and we wouldn't have to erroneously succeed.

[1]


[2]
desc.Platform = &platforms[0]

[3]
handler = images.SetChildrenLabels(cs, handler)

[4]
var platformMatcher = platforms.All

[5] #5916
[6]
if d.Platform == nil || m.Match(*d.Platform) {

@dmcgowan
Copy link
Copy Markdown
Member

That PR basically made the Windows' platforms.Default stricter, requiring the image to have the same OS Version prefix as the host.

Tar images imported through ctr import do not have any OSVersion field...

Does it need to be this strict when OSVersion is missing?

@kzys
Copy link
Copy Markdown
Member Author

kzys commented Sep 17, 2021

This PR wouldn't fix the issue as @claudiubelu said.

@kzys kzys closed this Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ctr.exe images import can fail on Windows

8 participants