Skip to content

Conversation

@lowenna
Copy link
Member

@lowenna lowenna commented Aug 17, 2016

Signed-off-by: John Howard [email protected]

@stevvooe @aaronlehmann @swernli Would appreciate your reviews here as this isn't an area of the code I'm overly familiar with, so I'm not sure if there's inadvertent side-effects.

@PatrickLang somehow had managed to get a machine into a state where layers were missing, or at least the daemons database was out of reality with what was on-disk. This was causing the daemon to fail to start:

time="2016-08-17T09:15:17.943216500-07:00" level=debug msg="Listener created for HTTP on npipe (//./pipe/docker_engine)" 
time="2016-08-17T09:15:32.226688000-07:00" level=info msg="Windows default isolation mode: hyperv" 
time="2016-08-17T09:15:32.226688000-07:00" level=debug msg="Using default logging driver json-file" 
time="2016-08-17T09:15:32.236723300-07:00" level=debug msg="WindowsGraphDriver InitFilter at C:\\ProgramData\\docker\\windowsfilter" 
time="2016-08-17T09:15:32.265923900-07:00" level=info msg="[graphdriver] using prior storage driver: windowsfilter" 
time="2016-08-17T09:15:32.265923900-07:00" level=debug msg="Using graph driver windowsfilter" 
time="2016-08-17T09:15:32.223672800-07:00" level=debug msg="Stackdump - waiting signal at Global\\docker-daemon-16852" 
time="2016-08-17T09:15:32.354661500-07:00" level=debug msg="Max Concurrent Downloads: 3" 
time="2016-08-17T09:15:32.359664300-07:00" level=debug msg="Max Concurrent Uploads: 5" 
time="2016-08-17T09:15:51.772631500-07:00" level=fatal msg="Error starting daemon: layer does not exist" 
Program exited with code 0

This PR does a very similar fix to https://github.com/docker/docker/blob/master/image/fs.go#L80-L81 to skip an invalid digest, but for the case if a layer does not exist (ErrLayerDoesNotExist in the layer package).

After this fix:

time="2016-08-17T09:49:09.842027700-07:00" level=debug msg="Listener created for HTTP on npipe (//./pipe/docker_engine)" 
time="2016-08-17T09:49:09.843029200-07:00" level=info msg="Windows default isolation mode: hyperv" 
time="2016-08-17T09:49:09.844028300-07:00" level=debug msg="Using default logging driver json-file" 
time="2016-08-17T09:49:09.846048700-07:00" level=debug msg="WindowsGraphDriver InitFilter at C:\\ProgramData\\docker\\windowsfilter" 
time="2016-08-17T09:49:09.844028300-07:00" level=debug msg="Stackdump - waiting signal at Global\\docker-daemon-14980" 
time="2016-08-17T09:49:09.847028900-07:00" level=info msg="[graphdriver] using prior storage driver: windowsfilter" 
time="2016-08-17T09:49:09.847550400-07:00" level=debug msg="Using graph driver windowsfilter" 
time="2016-08-17T09:49:09.856037400-07:00" level=debug msg="Max Concurrent Downloads: 3" 
time="2016-08-17T09:49:09.856534400-07:00" level=debug msg="Max Concurrent Uploads: 5" 
time="2016-08-17T09:49:14.340674200-07:00" level=debug msg="Skipping missing layer sha256:98b59c44f72d45b576975c60f00ac7e652aed319656a1ef5f255e68dc77a004e: layer does not exist" 
time="2016-08-17T09:49:19.206632000-07:00" level=debug msg="Skipping missing layer sha256:be1da82470363009d64cb54417a97a38dfcca448116a5ccd9cd5d4b3f78c663c: layer does not exist" 
time="2016-08-17T09:49:21.790894400-07:00" level=info msg="Graph migration to content-addressability took 0.00 seconds" 
time="2016-08-17T09:49:21.974429200-07:00" level=debug msg="Loaded container 0af3528b2d7036217f274d15a657c05e1a4939fa3f4f1a4c96b85cde02474e12" 
time="2016-08-17T09:49:21.974937700-07:00" level=debug msg="Loaded container 2a4461f112942c0ee1956ade6e3db148729d7cc8bc9211489ddcb7d7d750640c" 
... rest of daemon startup

@aaronlehmann
Copy link

ping @dmcgowan @tonistiigi

My suspicion is that Walk isn't the right place to do this. I would expect Walk to return an error if the callback returns layer.ErrLayerDoesNotExist.

@stevvooe
Copy link
Contributor

@jhowardmsft Put differently, IDWalkFunc should not let the error escape unless it intends to terminate the Walk.

@lowenna
Copy link
Member Author

lowenna commented Aug 18, 2016

Hmm. Yeah, I get the point, but not sure where the fix should go. There's already an almost identical get-out-of-jail check immediately above checking the digest is valid.

[Edit - assuming that we all agree that the daemon should still start in this scenario and handle the case in some way]

image/fs.go Outdated
}
if err := f(ID(dgst)); err != nil {
if err == layer.ErrLayerDoesNotExist {
logrus.Debugf("Skipping missing layer %s: %s", dgst, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we get this in the right spot, this should probably be a warning.

@stevvooe
Copy link
Contributor

@jhowardmsft Through analysis, I am unable to figure out which walk call is causing this. Could you run this same patch with a debug.PrintStack() next to log message so we can identify the caller?

@lowenna
Copy link
Member Author

lowenna commented Aug 18, 2016

Yes, thank you! I'll do this in the morning.

@lowenna
Copy link
Member Author

lowenna commented Aug 18, 2016

Here's the stack:

goroutine 1 [running]:
runtime/debug.Stack(0x0, 0x0, 0x0)
    C:/Go/src/runtime/debug/stack.go:24 +0x87
runtime/debug.PrintStack()
    C:/Go/src/runtime/debug/stack.go:16 +0x1f
github.com/docker/docker/image.(*fs).Walk(0xc08239f800, 0xc08239d0a0, 0x0, 0x0)
    e:/go/src/github.com/docker/docker/image/fs.go:87 +0x551
github.com/docker/docker/image.(*store).restore(0xc0823a5840, 0x0, 0x0)
    e:/go/src/github.com/docker/docker/image/store.go:64 +0xb1
github.com/docker/docker/image.NewImageStore(0x35f06f8, 0xc08239f800, 0x35f0750, 0xc0823a4c80, 0x0, 0x0, 0x0, 0x0)
    e:/go/src/github.com/docker/docker/image/store.go:56 +0x1d4
github.com/docker/docker/daemon.NewDaemon(0xc082388a00, 0x35f0138, 0xc0823b6010, 0x35f0100, 0x241e830, 0x0, 0x0, 0x0)
    e:/go/src/github.com/docker/docker/daemon/daemon.go:539 +0x1b44
main.(*DaemonCli).start(0xc082375b90, 0x0, 0x0)
    e:/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:265 +0x1915
main.main()
    e:/go/src/github.com/docker/docker/cmd/dockerd/docker.go:68 +0x39a

@lowenna lowenna force-pushed the jjh-missingdigest branch from 69b5772 to 78d283d Compare August 18, 2016 16:45
@lowenna
Copy link
Member Author

lowenna commented Aug 18, 2016

Based on that stack - moved the check to the restore function which is hopefully more palatable 😄. Unfortunately though it can't log what layer/image failed in that context in the way errors are currently passed back.

@stevvooe
Copy link
Contributor

@jhowardmsft Try filtering at line 73 or 78 in store.go. :)

By putting the error outside of the Walk, the loading of all layers will be interrupted by this single error.

@lowenna lowenna force-pushed the jjh-missingdigest branch from 78d283d to adeddfd Compare August 19, 2016 16:50
@lowenna
Copy link
Member Author

lowenna commented Aug 19, 2016

@stevvooe Doh. Yes, of course. Updated and verified. (It only needed the first one as it's the Get that fails). Thanks for your help finding the right place for the check 😄

PS E:\go\src\github.com\docker\docker> dockerd
time="2016-08-19T09:50:46.166452500-07:00" level=info msg="Windows default isolation mode: hyperv"
time="2016-08-19T09:50:46.167451200-07:00" level=info msg="[graphdriver] using prior storage driver: windowsfilter"
time="2016-08-19T09:50:46.178451900-07:00" level=warning msg="Skipping missing chain ID sha256:b1a580cdd04afd8533e1110f75af6bdd4eb9df569f03b3a967b10559f14d5aeb. ID sha256:98b59c44f72d45b576975c60f00ac7e652aed319656a1ef5f255e68dc77a004e. Layer not found"
time="2016-08-19T09:50:46.178451900-07:00" level=warning msg="Skipping missing chain ID sha256:e299fe1dc3af0298516b521f040ff10044d689be345f506d05e080f5050e3936. ID sha256:be1da82470363009d64cb54417a97a38dfcca448116a5ccd9cd5d4b3f78c663c. Layer not found"
time="2016-08-19T09:50:46.189452300-07:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2016-08-19T09:50:46.189452300-07:00" level=info msg="Loading containers: start."
.......
time="2016-08-19T09:50:46.324936500-07:00" level=info msg="Loading containers: done."
time="2016-08-19T09:50:46.324936500-07:00" level=info msg="Daemon has completed initialization"
time="2016-08-19T09:50:46.324936500-07:00" level=info msg="Docker daemon" commit=adeddfd-jhoward-JHOWARD-Z420-Dynamic graphdriver=windowsfilter version=1.13.0-dev
time="2016-08-19T09:50:46.329934200-07:00" level=info msg="API listen on //./pipe/docker_engine"

@stevvooe
Copy link
Contributor

Ok, code looks good now.

But, why are these layers missing? (should have asked that earlier)

@lowenna
Copy link
Member Author

lowenna commented Aug 20, 2016

@stevvooe a great question but one I don't know the answer to. I was planning on digging in further as to what caused that once I'm back from vacation (29th). @PatrickLang please keep that machine around if you can in that state, but if not, at least zip up the programdata\docker directory to keep safe so I can do some more analysis on it.

@PatrickLang
Copy link

I'll keep that machine as is. I left it in the office and am out most of next week at ContainerCon/LinuxCon

Sent from a tiny keyboard

On Aug 19, 2016, at 6:20 PM, John Howard [email protected] wrote:

@stevvooe a great question but one I don't know the answer too. I was planning on digging in further as to what caused that once I'm back from vacation (29th). @PatrickLang please keep that machine around if you can in that state, but if not, at least zip up the programdata\docker directory to keep safe so I can do some more analysis on it.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@stevvooe
Copy link
Contributor

@PatrickLang I am at (Container/Linux) * Con. Come find me and let's see if we can figure this out. Skipping bad/missing layers worries me.

@PatrickLang
Copy link

We talked and agree we need to get an error message that better points to the source of the problem. I think I know how I corrupted it so I'll give that a try.

@stevvooe
Copy link
Contributor

@PatrickLang Were you able to reproduce the corruption/invalid data?

@lowenna
Copy link
Member Author

lowenna commented Aug 29, 2016

I agree that root causing it needs further investigation, but having the daemon not be able to start at all is obviously very undesirable.

@PatrickLang
Copy link

@stevvooe I'm giving it a try today. I was out the last few days.

@stevvooe
Copy link
Contributor

@jhowardmsft Agreed. We want to have a good error message here for what action the user should take. My worry is that running after encountering this could lead to some sort of exploit.

@PatrickLang
Copy link

@stevvooe I backed up this machine and managed to clone it to a VM. @jhowardmsft should have everything he needs to work on any future changes.

The original steps to reproduce the problem aren't possible because I had installed an expired Windows Insider build, then upgraded from there. The old build won't boot or install any longer because the cert has expired.

Can we move forward with the fix to start the daemon correctly?

@lowenna
Copy link
Member Author

lowenna commented Sep 3, 2016

Debugging some more. But this code is black magic to me. Had to keep using the laptop in the end - restoring the VM didn't work. The high-level cause seems down to a different chainID being returned between 1.12.1 (which starts up fine) and 1.13 (which fails) at this line: https://github.com/docker/docker/blob/master/image/store.go#L71. With some extra debugging in the restore() and walk() functions, here's the 1.12.1 log snipped:

...
Walk sub function with ID sha256:5880b90e62ad24b21f7ff6765317dccf88af64903340a5f7a604b3a4f5729ee6
chainID= sha256:16f31274ec03c19d9b85b7fdab8f54894b855fc133311d3e9e48a9c424fcf0a8
digest.Digest(id)= sha256:5880b90e62ad24b21f7ff6765317dccf88af64903340a5f7a604b3a4f5729ee6
In loop: dgst sha256:98b59c44f72d45b576975c60f00ac7e652aed319656a1ef5f255e68dc77a004e
Calling f(ID(dgst))...
Walk sub function with ID sha256:98b59c44f72d45b576975c60f00ac7e652aed319656a1ef5f255e68dc77a004e
chainID= sha256:3821c2c9a0923d0a84b5f1ebc1aa8b92e15a2144a73b78388a2aa4b24543d070
digest.Digest(id)= sha256:98b59c44f72d45b576975c60f00ac7e652aed319656a1ef5f255e68dc77a004e
In loop: dgst sha256:be1da82470363009d64cb54417a97a38dfcca448116a5ccd9cd5d4b3f78c663c
Calling f(ID(dgst))...
Walk sub function with ID sha256:be1da82470363009d64cb54417a97a38dfcca448116a5ccd9cd5d4b3f78c663c
chainID= sha256:b98423228d7512cb7264e72add9f3e8d37bd778422b4b0d7fb600ff8e6954561
digest.Digest(id)= sha256:be1da82470363009d64cb54417a97a38dfcca448116a5ccd9cd5d4b3f78c663c
In loop: dgst sha256:fd3a87db730fdbbc9f591162064d0d540714ad77987a1fb42f9eaa5482879521
Calling f(ID(dgst))...
Walk sub function with ID sha256:fd3a87db730fdbbc9f591162064d0d540714ad77987a1fb42f9eaa5482879521
chainID= sha256:ca1c92dd593a3082b2843f538539ad252a64f9be142aa2ac169feacaee1d7291
digest.Digest(id)= sha256:fd3a87db730fdbbc9f591162064d0d540714ad77987a1fb42f9eaa5482879521
Walk out of loop
Second pass
End of restore()
time="2016-09-02T17:12:08.395123300-07:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
...

But on 1.13-dev (master), a different chainID is returned for the walk function with ID

...
Walk sub function with ID sha256:5880b90e62ad24b21f7ff6765317dccf88af64903340a5f7a604b3a4f5729ee6
chainID= sha256:16f31274ec03c19d9b85b7fdab8f54894b855fc133311d3e9e48a9c424fcf0a8
digest.Digest(id)= sha256:5880b90e62ad24b21f7ff6765317dccf88af64903340a5f7a604b3a4f5729ee6
In loop: dgst sha256:98b59c44f72d45b576975c60f00ac7e652aed319656a1ef5f255e68dc77a004e
Calling f(ID(dgst))...
Walk sub function with ID sha256:98b59c44f72d45b576975c60f00ac7e652aed319656a1ef5f255e68dc77a004e
chainID= sha256:b1a580cdd04afd8533e1110f75af6bdd4eb9df569f03b3a967b10559f14d5aeb
Failed to get chain ID layer does not exist
goroutine 1 [running]:
runtime/debug.Stack(0x0, 0x0, 0x0)
        C:/Go/src/runtime/debug/stack.go:24 +0x87
runtime/debug.PrintStack()
        C:/Go/src/runtime/debug/stack.go:16 +0x1f
github.com/docker/docker/image.(*fs).Walk(0xc082396fc0, 0xc0823bc2e0, 0x0, 0x0)
        e:/go/src/github.com/docker/docker/image/fs.go:91 +0xdda
github.com/docker/docker/image.(*store).restore(0xc082379880, 0x0, 0x0)
        e:/go/src/github.com/docker/docker/image/store.go:64 +0xb1
...

@lowenna
Copy link
Member Author

lowenna commented Sep 3, 2016

Of course, it's different due to the base layer changes. The 1.12 ChainID() implementation is completely different on Windows to the unified 1.13 version.. https://github.com/docker/docker/blob/1.12.x/image/rootfs_windows.go#L35. @swernli - need to talk on Tuesday :)

@lowenna
Copy link
Member Author

lowenna commented Sep 6, 2016

@stevvooe @swernli I might have a slightly better fix after further playing with the code this morning. As I say though, this really isn't an area of code I know particularly well. It could potentially be improved by having a _windows.go file, but in many ways I like keeping the unified file. This does also solve the daemon not starting issue on @PatrickLang's laptop. If only the function also returned a well typed error, but that would be a lot more invasive a fix...

Commit: microsoft@710b456

@aaronlehmann
Copy link

The linked commit looks sensible to me. The old image format shouldn't prevent the daemon from starting. It might be a good idea to log the fact that an image was ignored, though.

@swernli
Copy link
Contributor

swernli commented Sep 6, 2016

Yup, microsoft/docker@710b456 looks good to me too.

@stevvooe
Copy link
Contributor

stevvooe commented Sep 6, 2016

microsoft/docker@710b456 looks fine.

@lowenna
Copy link
Member Author

lowenna commented Sep 7, 2016

PR updated with newer fix as per above. Verified daemon starts on Patricks laptop.

@stevvooe
Copy link
Contributor

stevvooe commented Sep 7, 2016

LGTM

1 similar comment
@tonistiigi
Copy link
Member

LGTM

@tonistiigi tonistiigi merged commit b890d47 into moby:master Sep 7, 2016
@lowenna lowenna deleted the jjh-missingdigest branch September 7, 2016 23:41
@thaJeztah thaJeztah added this to the 1.13.0 milestone Sep 26, 2016
@thaJeztah thaJeztah changed the title Don't fail daemon start on missing layer Don't fail daemon start on missing layer (skip layers+base images) Jul 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants