Fix lease management during pull and export#48910
Conversation
| // Prune leases | ||
| leaseManager := i.client.LeasesService() | ||
| pullLeases, err := leaseManager.List(ctx, `labels."moby/action.pull"`) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| for i, lease := range pullLeases { | ||
| opts := []leases.DeleteOpt{} | ||
| if i == len(pullLeases)-1 { | ||
| opts = append(opts, leases.SynchronousDelete) | ||
| } | ||
| if err := leaseManager.Delete(ctx, lease, opts...); err != nil { | ||
| return nil, err | ||
| } | ||
| } |
There was a problem hiding this comment.
For a follow up - not sure how feasible that is, but it would be nice if we'd be able to determine how much disk space was freed.
thaJeztah
left a comment
There was a problem hiding this comment.
overall LGTM
left some comments; nothing really blocking (I think), but possibly some for follow-ups. The timeout is the one I'm wondering if we should shorten as part of this PR (but open to input).
daemon/containerd/image_exporter.go
Outdated
| } | ||
| defer func() { | ||
| if err := leasesManager.Delete(ctx, lease); err != nil { | ||
| if err := done(context.WithoutCancel(ctx)); err != nil { |
There was a problem hiding this comment.
Silly question (would have to look where all it's used); are there done() functions returned by such functions that should always be called with WithoutCancel()?
Considering if there's cases where the done function returned should handle the WithoutCancel parts so that consumers don't have to be dealing with those.
There was a problem hiding this comment.
In the export case there really isn't a need to hold onto anything since even if the operation is canceled, it will need to restart from the beginning on retry. For the pull case, a cancel may want to keep progress so the context error could be respected. We could make this more explicit and check for a context error rather than let the function error from it.
daemon/containerd/image_pull.go
Outdated
| // How long downloaded content will be held after cancellation | ||
| expireAt := time.Now().Add(8 * time.Hour) |
There was a problem hiding this comment.
Do we want a const for this?
Also considering if this should be shorter even; I somewhat expect the "resume previous pull" to be happening in a much shorter timeframe (1, 2 hours?);
- starting a pull, realising you're either pulling the wrong image, or (e.g. in CI); test failed, and cancelled the pull
- next attempt to pull will be either immediate after, or after fixing what caused the pull to be cancelled
Are these partial pulls only completed layers (pull for a layer completed, but not for the full image), or can these also be "partial downloads"?
There was a problem hiding this comment.
It should for partial downloads of the blobs. Any partially unpacked snapshotters should be discarded as I believe we always use a unique identifier for unpack.
I'm adding a const. We can make this configurable as well in the future if we want and make a smaller value. With it static, it really has to cover the upper bound of possible pull times.
daemon/containerd/image_pull.go
Outdated
| ctx, done, err := i.client.WithLease(ctx, | ||
| leases.WithRandomID(), | ||
| leases.WithLabels(map[string]string{ | ||
| "moby/action.pull": baseRef.String(), |
There was a problem hiding this comment.
Should we have a const for this one? (currently only used in two places, but not sure if we (should) / will be using the same label in other locations? Do we know if BuildKit also sets such labels, and if these are things we should align?
There was a problem hiding this comment.
Maybe later we can, the other spot we use the label is in a filter string which requires extra formatting. I prefer just to see the value and we can either add a utility later or ensure they remain consistent with a test.
daemon/containerd/image_pull.go
Outdated
| leases.WithRandomID(), | ||
| leases.WithLabels(map[string]string{ | ||
| "moby/action.pull": baseRef.String(), | ||
| "containerd.io/gc.expire": expireAt.Format(time.RFC3339), |
There was a problem hiding this comment.
Hm... I was surprised that these were not exported as either a const or as a utility in containerd; should they be?
moby/vendor/github.com/containerd/containerd/metadata/gc.go
Lines 60 to 65 in 803534f
I see containerd v2 added some extra documentation to some of these to outline the required format for values set; https://github.com/containerd/containerd/blob/f0ebbd3c8a662656b458e7e1cca059a690d557a7/core/metadata/gc.go#L69-L76
// labelGCExpire indicates that an object is collectible after the
// provided time. For image objects, this makes them available to
// garbage collect when expired, when not provided, image objects
// are root objects that never expire. For non-root objects such
// as content or snapshots, these objects will be treated like
// root objects before their expiration.
// Expected format is RFC 3339
labelGCExpire = []byte("containerd.io/gc.expire")So maybe a utility could make most sense;
leases.WithGCExpireAt(time.Time)(or something) to make sure it's in the correct format- Some of the other ones may require a specific format as well, so perhaps could require a utility as well to validate (instead of a plain
constfor the right label)?
There was a problem hiding this comment.
There is a utility, here it is just to be specific. There is no logic around it though and the format doesn't change, seemed cleaner to just clearly see the labels that are getting set.
There was a problem hiding this comment.
Oh, I see there's indeed a WithExpiration;
moby/vendor/github.com/containerd/containerd/leases/lease.go
Lines 81 to 91 in 113d3ce
Yeah, mostly looking that using a consistent approach may help finding related things. In this case, I just checked where we're setting these, and I found these;
git grep '\.WithExpiration' -- ':!vendor'
builder/builder-next/adapters/containerimage/pull.go: ctx, done, err := leaseutil.WithLease(ctx, p.is.LeaseManager, leases.WithExpiration(5*time.Minute), leaseutil.MakeTemporary)
daemon/containerd/image_builder.go: ctx, _, err := i.client.WithLease(ctx, leases.WithRandomID(), leases.WithExpiration(1*time.Hour))
daemon/containerd/image_builder.go: leases.WithExpiration(time.Hour*24),
daemon/containerd/image_builder.go: ctx, release, err := i.client.WithLease(ctx, leases.WithRandomID(), leases.WithExpiration(1*time.Hour))
daemon/containerd/image_commit.go: ctx, release, err := i.client.WithLease(ctx, leases.WithRandomID(), leases.WithExpiration(1*time.Hour))
daemon/images/image_pull.go: leases.WithExpiration(24 * time.Hour),Interestingly it looks like for non-containerd image-pulls, we still use 24 hours, but a different label;
moby/daemon/images/image_pull.go
Lines 118 to 125 in 113d3ce
And for classic builder with containerd as well, and again with a different label;
moby/daemon/containerd/image_builder.go
Lines 235 to 241 in 113d3ce
Perhaps those should be corrected as well?
There was a problem hiding this comment.
I'm not going to touch the ones outside of daemon/containerd for now, those might be worth looking but not related to the issues we are seeing with the containerd integration. I added a withLease method to our own image service that will add the labels and logic that is consistent with pruning.
c12a063 to
431d096
Compare
431d096 to
d29fe20
Compare
|
Looks unrelated, can someone kick the re-run? |
daemon/containerd/leases.go
Outdated
|
|
||
| ctx = leases.WithLease(ctx, l.ID) | ||
| return ctx, func() { | ||
| if ctx.Err() != nil { |
There was a problem hiding this comment.
This seems racey. Seems like we should always use context.WithoutCancel regardless of if the context is cancelled.
| pruneLeaseFilter = `labels."moby/prune.images"` | ||
| ) | ||
|
|
||
| func (i *ImageService) withLease(ctx context.Context, cancellable bool) (context.Context, func(), error) { |
There was a problem hiding this comment.
Can you help me understand the reason one would set cancellable to true?
d29fe20 to
c57c9c6
Compare
Looks like this test became flaky recently 🤔 |
|
@vvoland yeah, I think it's one of the tests that became flaky related to the runc 1.2 updates, and there were some remaining patches that didn't get merged (not sure if all of them were good though); #48160 (comment) |
|
| var outNewImg containerd.Image | ||
|
|
||
| if oldImage.Target.Digest != "" { | ||
| // Lease the old image content to prevent it from being garbage collected until we keep it as dangling image. |
There was a problem hiding this comment.
Is this comment still useful to preserve?
There was a problem hiding this comment.
I could add a comment to the withLease function, but that is the whole point of the lease, it seems redundant to describe what a lease is everywhere it is used.
There was a problem hiding this comment.
Yeah, this one felt like it added additional context to why we're holding on to the old image (which may not be immediately obvious).
| // Necessary to prevent the contents from being GC'd | ||
| // between writing them here and creating an image |
| // TODO: Consider a better way to do this. It is better to have a container directly | ||
| // reference a snapshot, however, that is not done today because a container may | ||
| // removed and recreated with nothing holding the snapshot in between. Consider | ||
| // removing this lease and only temporarily holding a lease on re-create, using | ||
| // non-expiring leases introduces the possibility of leaking resources. | ||
| ls := i.client.LeasesService() |
There was a problem hiding this comment.
Looks like the last commit now only adds this comment, but the commit message says;
Add label on snapshotter to warn about non-expiring leases
Ensure that leases have a reasonable expiration and are cleaned up during prune Signed-off-by: Derek McGowan <[email protected]>
Signed-off-by: Derek McGowan <[email protected]>
Currently when preparing a snapshot for a container, a lease is used to hold that snapshot for the lifespan of a container. That is workaround to preserve the snapshot when a container is recreated, however, the containerd object should be able to hold this reference itself. Signed-off-by: Derek McGowan <[email protected]>
c57c9c6 to
4becdac
Compare
Ensure that leases have a reasonable expiration and are cleaned up during prune
Currently it seems some leases could be leaked as they were created with no expiration and their removal could fail due to using a canceled context.
- Description for the changelog