Skip to content

Conversation

@dmcgowan
Copy link
Member

@dmcgowan dmcgowan commented May 7, 2018

A few races exist today if deleting an image during pull that are addressed by this PR.

  1. ctr image pull does not hold the same lease between fetch and unpack
  • Solved by getting a lease in the cli handler and passing to both functions
  1. Rootfs apply holds no references while checking layer existence (stat from bottom up and layers can get garbage collected during check).
  • Solved by reversing the check order from top down, using prepare to check existence of parent rather than stat
  1. Image object can get removed while unpack is taking place
  • Move image creation to after unpack, originally it existed before unpack because nothing else was holding reference to the content before leases

This did expose a common problematic pattern of relying on stat object + use object in the client. There is no guarantee after a stat returns that the object will still exist, even if the stat returned the object existed. For mutation and creation operations we normally add the object to the lease to prevent cleanup, but nothing is done on read operations today.

Signed-off-by: Derek McGowan <[email protected]>
@codecov-io
Copy link

codecov-io commented May 7, 2018

Codecov Report

Merging #2331 into master will decrease coverage by 4.29%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #2331     +/-   ##
=========================================
- Coverage   49.28%   44.98%   -4.3%     
=========================================
  Files          84       92      +8     
  Lines        7424     9340   +1916     
=========================================
+ Hits         3659     4202    +543     
- Misses       3089     4459   +1370     
- Partials      676      679      +3
Flag Coverage Δ
#linux 49.28% <ø> (ø) ⬆️
#windows 41.24% <ø> (?)
Impacted Files Coverage Δ
oci/spec.go 40% <0%> (-10%) ⬇️
snapshots/native/native.go 43.89% <0%> (-10%) ⬇️
metadata/snapshot.go 45.8% <0%> (-8.96%) ⬇️
archive/compression/compression.go 43.93% <0%> (-8.9%) ⬇️
remotes/docker/fetcher.go 41.42% <0%> (-7.6%) ⬇️
content/local/writer.go 52.63% <0%> (-7.37%) ⬇️
archive/tar.go 43.05% <0%> (-6.95%) ⬇️
metadata/containers.go 47.97% <0%> (-6.62%) ⬇️
metadata/buckets.go 54.66% <0%> (-5.04%) ⬇️
metadata/images.go 58.46% <0%> (-4.7%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6eee2a0...f0b3d5a. Read the comment docs.

rootfs/apply.go Outdated
if tries < 3 && errdefs.IsNotFound(err) {
if err := applyLayers(ctx, layers, chain, sn, a); err != nil {
if errdefs.IsUnavailable(err) {
time.Sleep(time.Second * time.Duration(tries))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to wait at a min 1 sec? Seems a little long but not sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can simplify this and get rid of the unavailability checks by moving the key generation into the loop. Likely this scenario cannot be reconciled and is reliant on cleanup anyway.

rootfs/apply.go Outdated
mounts, err = sn.Prepare(ctx, key, parent.String(), opts...)
if err != nil {
tries++
if tries < 3 && len(layers) > 1 && errdefs.IsNotFound(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kinda complex, do we need the len(layers) part in this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That len(layers) check is to prevent recursion when called from rootfs.ApplyLayer. I tried to unify the functions, let me see if I can do this in a less complex well, else will document that.

client.go Outdated
}

if pullCtx.Unpack {
if err := img.Unpack(ctx, pullCtx.Snapshotter); err != nil {
errors.Wrapf(err, "failed to unpack image on snapshotter %s", pullCtx.Snapshotter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, errors.Wrapf(err, "failed to unpack image on snapshotter %s", pullCtx.Snapshotter)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that isn't right #2011. I will make a seperate PR for that one.

@Random-Liu
Copy link
Member

Should we cherry-pick this?

@Random-Liu
Copy link
Member

Random-Liu commented May 7, 2018

Offline discussed with @dmcgowan.

Problem 1) and 3) are not a problem for cri today, because the internal image cache inside cri makes sure that image can only be deleted after it is fully pulled.

We'll update containerd and use WithPullUnpack containerd/cri#762 in cri, so that we can safely remove the internal image cache in the future for containerd/cri#760. But cherrypick is not needed.

dmcgowan added 3 commits May 7, 2018 15:50
Update apply layers to recursive from the top layer.
Update apply layer to check for exists and apply single layer.

Signed-off-by: Derek McGowan <[email protected]>
@dmcgowan dmcgowan force-pushed the fix-image-remove-race branch from 8b711ae to f0b3d5a Compare May 7, 2018 22:51
@crosbymichael
Copy link
Member

Is this ready to review again?

@crosbymichael
Copy link
Member

LGTM

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@estesp estesp merged commit e63768e into containerd:master May 16, 2018
@dmcgowan dmcgowan deleted the fix-image-remove-race branch June 5, 2018 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants