-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Don't fail daemon start on missing layer (skip layers+base images)
#25806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @dmcgowan @tonistiigi My suspicion is that |
|
@jhowardmsft Put differently, |
|
Hmm. Yeah, I get the point, but not sure where the fix should go. There's already an almost identical get-out-of-jail check immediately above checking the digest is valid. [Edit - assuming that we all agree that the daemon should still start in this scenario and handle the case in some way] |
image/fs.go
Outdated
| } | ||
| if err := f(ID(dgst)); err != nil { | ||
| if err == layer.ErrLayerDoesNotExist { | ||
| logrus.Debugf("Skipping missing layer %s: %s", dgst, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we get this in the right spot, this should probably be a warning.
|
@jhowardmsft Through analysis, I am unable to figure out which walk call is causing this. Could you run this same patch with a |
|
Yes, thank you! I'll do this in the morning. |
|
Here's the stack: |
69b5772 to
78d283d
Compare
|
Based on that stack - moved the check to the restore function which is hopefully more palatable 😄. Unfortunately though it can't log what layer/image failed in that context in the way errors are currently passed back. |
|
@jhowardmsft Try filtering at line 73 or 78 in store.go. :) By putting the error outside of the |
78d283d to
adeddfd
Compare
|
@stevvooe Doh. Yes, of course. Updated and verified. (It only needed the first one as it's the |
|
Ok, code looks good now. But, why are these layers missing? (should have asked that earlier) |
|
@stevvooe a great question but one I don't know the answer to. I was planning on digging in further as to what caused that once I'm back from vacation (29th). @PatrickLang please keep that machine around if you can in that state, but if not, at least zip up the programdata\docker directory to keep safe so I can do some more analysis on it. |
|
I'll keep that machine as is. I left it in the office and am out most of next week at ContainerCon/LinuxCon Sent from a tiny keyboard
|
|
@PatrickLang I am at (Container/Linux) * Con. Come find me and let's see if we can figure this out. Skipping bad/missing layers worries me. |
|
We talked and agree we need to get an error message that better points to the source of the problem. I think I know how I corrupted it so I'll give that a try. |
|
@PatrickLang Were you able to reproduce the corruption/invalid data? |
|
I agree that root causing it needs further investigation, but having the daemon not be able to start at all is obviously very undesirable. |
|
@stevvooe I'm giving it a try today. I was out the last few days. |
|
@jhowardmsft Agreed. We want to have a good error message here for what action the user should take. My worry is that running after encountering this could lead to some sort of exploit. |
|
@stevvooe I backed up this machine and managed to clone it to a VM. @jhowardmsft should have everything he needs to work on any future changes. The original steps to reproduce the problem aren't possible because I had installed an expired Windows Insider build, then upgraded from there. The old build won't boot or install any longer because the cert has expired. Can we move forward with the fix to start the daemon correctly? |
|
Debugging some more. But this code is black magic to me. Had to keep using the laptop in the end - restoring the VM didn't work. The high-level cause seems down to a different chainID being returned between 1.12.1 (which starts up fine) and 1.13 (which fails) at this line: https://github.com/docker/docker/blob/master/image/store.go#L71. With some extra debugging in the restore() and walk() functions, here's the 1.12.1 log snipped: But on 1.13-dev (master), a different chainID is returned for the walk function with ID |
|
Of course, it's different due to the base layer changes. The 1.12 ChainID() implementation is completely different on Windows to the unified 1.13 version.. https://github.com/docker/docker/blob/1.12.x/image/rootfs_windows.go#L35. @swernli - need to talk on Tuesday :) |
|
@stevvooe @swernli I might have a slightly better fix after further playing with the code this morning. As I say though, this really isn't an area of code I know particularly well. It could potentially be improved by having a _windows.go file, but in many ways I like keeping the unified file. This does also solve the daemon not starting issue on @PatrickLang's laptop. If only the function also returned a well typed error, but that would be a lot more invasive a fix... Commit: microsoft@710b456 |
|
The linked commit looks sensible to me. The old image format shouldn't prevent the daemon from starting. It might be a good idea to log the fact that an image was ignored, though. |
|
Yup, microsoft/docker@710b456 looks good to me too. |
|
microsoft/docker@710b456 looks fine. |
Signed-off-by: John Howard <[email protected]>
adeddfd to
540c8e9
Compare
|
PR updated with newer fix as per above. Verified daemon starts on Patricks laptop. |
|
LGTM |
1 similar comment
|
LGTM |
layers+base images)
Signed-off-by: John Howard [email protected]
@stevvooe @aaronlehmann @swernli Would appreciate your reviews here as this isn't an area of the code I'm overly familiar with, so I'm not sure if there's inadvertent side-effects.
@PatrickLang somehow had managed to get a machine into a state where layers were missing, or at least the daemons database was out of reality with what was on-disk. This was causing the daemon to fail to start:
This PR does a very similar fix to https://github.com/docker/docker/blob/master/image/fs.go#L80-L81 to skip an invalid digest, but for the case if a layer does not exist (
ErrLayerDoesNotExistin thelayerpackage).After this fix: