Skip to content

Content addressability#17924

Merged
calavera merged 7 commits intomoby:masterfrom
aaronlehmann:content-addressability
Nov 24, 2015
Merged

Content addressability#17924
calavera merged 7 commits intomoby:masterfrom
aaronlehmann:content-addressability

Conversation

@tonistiigi
Copy link
Member

This PR changes the Docker engine to use content addressable storage for images and layers. This means that image IDs are no longer arbitrarily assigned - now they are computed based on the filesystem contents and the image configuration. Images can securely share their underlying layers without duplicating data on disk.

This work involves a big refactoring of image-related code in the engine. The graph package is entirely removed, and is replaced by a few new packages:

  • layer: Layer store interface and implementation. This manages filesystem layers without being aware of image configurations. The layer store maintains reference counts for layers and removes unreferenced layers.
  • image: Image store interface and implementation. This manages image configurations. Image configurations include runtime configuration and reference the underlying filesystem data.
  • tag: The tag functionality from the graph package has been separated out and moved here. Tags now use types from the distribution/reference package instead of plain strings.
  • distribution: Push and pull code that originally was part of the graph package has been separated out into its own package.

The first time a version of the engine with these commits is started, it will migrate old graph metadata to the new format. This involves calculating content hashes for the existing data, but it does not move underlying graphdriver filesystem data. It doesn’t remove old graph metadata, so the migration process is not destructive.

The new data model does not have a one-to-one relationship between images and layers. A single image can have many layers. Existing versions of Docker create an image for each layer, and use the parent chain to link them together. That means that pulling a specific image requires pulling all the artifacts from the original build process of that image. With this PR, when an image is pulled from a registry, only a single image is created, to match the new data model. The history is preserved through a list of commands, dates, etc.

Summary of UI Changes:

  • Pull/push do not transfer the entire parent chain.
  • Full-length image IDs have a sha256: prefix. This prefix is hidden for truncated image IDs for convenience.
  • Images don't have the concept of VirtualSize (because there’s no special meaning for a top layer anymore).

Future work:

Unit test coverage for the new packages:

ok      github.com/docker/docker/image  0.051s  coverage: 85.8% of statements
ok      github.com/docker/docker/layer  0.223s  coverage: 76.0% of statements
ok      github.com/docker/docker/migrate/v1 0.033s  coverage: 71.1% of statements
ok      github.com/docker/docker/tag    0.017s  coverage: 86.1% of statements

For the full design document, see https://gist.github.com/aaronlehmann/b42a2eaf633fc949f93b

@tiborvass tiborvass added this to the 1.10 milestone Nov 12, 2015
@aaronlehmann aaronlehmann force-pushed the content-addressability branch 3 times, most recently from 8b59d05 to 1d793b1 Compare November 12, 2015 01:55
@lowenna
Copy link
Member

lowenna commented Nov 12, 2015

@swernli - Stefan can you review from the Windows side please?

@aaronlehmann aaronlehmann force-pushed the content-addressability branch 4 times, most recently from 49ec06e to 44e6989 Compare November 12, 2015 03:19
@thaJeztah
Copy link
Member

The first time a version of the engine with these commits is started, it will migrate old graph metadata to the new format. This involves calculating content hashes for the existing data, but it does not move underlying graphdriver filesystem data. It doesn’t remove old graph metadata, so the migration process is not destructive.

Curious; will repeated upgrades / downgrades work? (e.g., Used 1.10 for testing, then continue on 1.9 and run 1.10 again)

@aaronlehmann
Copy link

Curious; will repeated upgrades / downgrades work? (e.g., Used 1.10 for testing, then continue on 1.9 and run 1.10 again)

New images built or pulled in 1.10 wouldn't be visible to 1.9. But the original images would still be there, unless you manually deleted some of them.

daemon/daemon.go Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be versioned? or can't it anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VirtualSize as a concept has been removed as all images freely share all of their layers and no layer is unique to an image. So the value Size is now same that was previously VirtualSize.

I changed the PR so that VirtualSize isn't cleared any more but shows same data. That should provide correct data to older clients. New clients don't use this field any more. As this field isn't useful for new clients we can figure out a deprecation path and this can be done through API versions, but I think we can do that after this PR is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks @tonistiigi just making sure we won't forget this

@swernli
Copy link
Contributor

swernli commented Nov 12, 2015

Reviewing for Windows...

@aaronlehmann aaronlehmann force-pushed the content-addressability branch from 59afa95 to 9c6133a Compare November 12, 2015 20:08
@thaJeztah
Copy link
Member

ping @docker/maintainers please review and test this one. It's a huge change, but really exciting stuff.

also:
please try to avoid merging large PRs while this PR isn't merged yet to avoid rebase hell

@jessfraz
Copy link
Contributor

\o/

On Thu, Nov 12, 2015 at 12:56 PM, Sebastiaan van Stijn <
[email protected]> wrote:

ping @docker/maintainers
https://github.com/orgs/docker/teams/maintainers please review and test
this one. It's a huge change, but really exciting stuff.

also:
please try to avoid merging large PRs while this PR isn't merged yet to
avoid rebase hell


Reply to this email directly or view it on GitHub
#17924 (comment).

@thaJeztah
Copy link
Member

A difference I see with this one (probably expected, but just to verify)

Dockerfile

FROM ubuntu:14.04
ENV foo=bar
RUN echo hello > /foo
docker build -t foo .

Docker 1.9:

docker history foobar
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
408581b4301b        7 seconds ago       /bin/sh -c echo hello > /foo                    6 B
012645df8321        7 seconds ago       /bin/sh -c #(nop) ENV foo=bar                   0 B
e9ae3c220b23        2 days ago          /bin/sh -c #(nop) CMD ["/bin/bash"]             0 B
a6785352b25c        2 days ago          /bin/sh -c sed -i 's/^#\s*\(deb.*universe\)$/   1.895 kB
0998bf8fb9e9        2 days ago          /bin/sh -c echo '#!/bin/sh' > /usr/sbin/polic   194.5 kB
0a85502c06c9        2 days ago          /bin/sh -c #(nop) ADD file:531ac3e55db4293b8f   187.7 MB

With this PR:

docker history foobar
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
a12448c314a7        6 minutes ago       /bin/sh -c echo hello > /foo                    2.048 kB
e39f86d96664        6 minutes ago       /bin/sh -c #(nop) ENV foo=bar                   0 B
4b1e42b414f6        2 days ago          /bin/sh -c #(nop) CMD ["/bin/bash"]             1.024 kB
<missing>           2 days ago          /bin/sh -c sed -i 's/^#\s*\(deb.*universe\)$/   4.608 kB
<missing>           2 days ago          /bin/sh -c echo '#!/bin/sh' > /usr/sbin/polic   208.9 kB
<missing>           2 days ago          /bin/sh -c #(nop) ADD file:531ac3e55db4293b8f   196.8 MB

Differences:

  • sizes reported are different
  • images from the parent image are reported as <missing> (I expect that to be by design, but just checking)

@tonistiigi
Copy link
Member Author

@thaJeztah The sizes are currently from tars so they have the added padding. We are exploring ways to revert this if possible. <missing> is shown if the parent image is not locally available. This is because you probably did a fresh pull and got a flat image without parents for ubuntu.

@thaJeztah
Copy link
Member

@tonistiigi correct, it was a fresh pull

@swernli
Copy link
Contributor

swernli commented Nov 12, 2015

Currently, these changes break the use of alternate ID files in layers necessary for Windows containers to work. I'm looking for a fix that can be applied to these changes to unblock us.

@moxiegirl
Copy link
Contributor

Not in doc review but this could have documentation impact that is pretty big depending on the implementation. So, engineering hours could be required in terms of reviewing at the least. I'd like to include this information early in the PR lifecycle.

@metayd
Copy link
Contributor

metayd commented Feb 17, 2016

The packages I was talking about are called image and layer. They are in the top-level directory of this repository.
@aaronlehmann Thanks.

@metayd
Copy link
Contributor

metayd commented Feb 20, 2016

@aaronlehmann @tonistiigi Hello, Is there any tools that I can used to calculate the content-hash of a tar file?
I have tried tar -xOf layer.tar | sha256sum, but the hash value I got is different from the hash value that calculated from docker daemon.
Or is there anything special in the go tar-split package?

@stevvooe
Copy link
Contributor

@dbdd4us The content hash of a layer is the hash of the compressed content, as sent to the registry.

@tonistiigi
Copy link
Member Author

@stevvooe layer hashes in image config are the hashes of uncompressed content.
@dbdd4us I'm not sure what the layer.tar is in your example. Checksum is for all the bytes in tar, there is no extraction step involved. Easiest to get those tars is with docker save. Note that these do not match the image IDs that are checksums of image config + layer checksums. Make sure to also read the design doc in first post.

petrosagg added a commit to balena-io-archive/docker that referenced this pull request Mar 29, 2017
When content addressablity was introduced in moby#17924, a compatibility
layer for registry v1 pushes was added. When the engine is asked to
push an image to a v1 registry it needs to create v1 IDs for the images.

The strategy so far has been to use the full imageID for the first v1
layer and the ChainID for all other layers, effectively creating as many
v1 layers as there are in the image. Only the top most layer contained
the image configuration and the other layers had a dummy json containing
only a parent reference.

This becomes problematic when the first layer of the image is big.
Consinder the following two Dockerfiles:

FROM busybox
RUN create_very_big_file
CMD /foo

FROM busybox
RUN create_very_big_file
CMD /bar

Both of these images will have the exact same layers, with the layer
created by `RUN create_very_big_file` being the topmost one, but their
imageIDs will differ since they have a different CMD and therefore
different image configs.

When pushing to a v1 registry, the `RUN create_very_big_file` layer will
be pushed twice, once with the v1 ID set to foo's imageID and once with
the v1 ID set to bar's imageID. Also, any clients wanting to pull those
images won't realise it's the same layer and will proceed to download it
twice.

This commit solves this problem by separating the layers from the image
configuration information when pushing to a v1 registry. To do this, all
layers of an image are pushed with their ChainIDs and a synthetic top
level layer is created with its contents set to the EmptyLayer, it's
config set to the image config, and its v1 ID set to the imageID. This
will have the side-effect of adding one layer.

To prevent new layers being piled on top of each other forever, the code
checks if the topmost layer is already an empty layer and in that case
it uses that for the image configuration.

Signed-off-by: Petros Angelatos <[email protected]>
petrosagg added a commit to balena-io-archive/docker that referenced this pull request May 12, 2017
When content addressablity was introduced in moby#17924, a compatibility
layer for registry v1 pushes was added. When the engine is asked to
push an image to a v1 registry it needs to create v1 IDs for the images.

The strategy so far has been to use the full imageID for the first v1
layer and the ChainID for all other layers, effectively creating as many
v1 layers as there are in the image. Only the top most layer contained
the image configuration and the other layers had a dummy json containing
only a parent reference.

This becomes problematic when the first layer of the image is big.
Consinder the following two Dockerfiles:

FROM busybox
RUN create_very_big_file
CMD /foo

FROM busybox
RUN create_very_big_file
CMD /bar

Both of these images will have the exact same layers, with the layer
created by `RUN create_very_big_file` being the topmost one, but their
imageIDs will differ since they have a different CMD and therefore
different image configs.

When pushing to a v1 registry, the `RUN create_very_big_file` layer will
be pushed twice, once with the v1 ID set to foo's imageID and once with
the v1 ID set to bar's imageID. Also, any clients wanting to pull those
images won't realise it's the same layer and will proceed to download it
twice.

This commit solves this problem by separating the layers from the image
configuration information when pushing to a v1 registry. To do this, all
layers of an image are pushed with their ChainIDs and a synthetic top
level layer is created with its contents set to the EmptyLayer, it's
config set to the image config, and its v1 ID set to the imageID. This
will have the side-effect of adding one layer.

To prevent new layers being piled on top of each other forever, the code
checks if the topmost layer is already an empty layer and in that case
it uses that for the image configuration.

Signed-off-by: Petros Angelatos <[email protected]>
mountID := name
if runtime.GOOS != "windows" {
// windows has issues if container ID doesn't match mount ID
mountID = stringid.GenerateRandomID()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, i'm investigating moby source code, when i read this, i doubt if there is any reason that moby do not always use name regard of windows or not. Is there any consideration not commented here?

AkihiroSuda added a commit to AkihiroSuda/docker that referenced this pull request Nov 24, 2018
The v1.10 layout and the migrator was added in 2015 via moby#17924.

Although the migrator is not marked as "deprecated" explicitly in
cli/docs/deprecated.md, I suppose people should have already migrated
from pre-v1.10 and they no longer need the migrator, because pre-v1.10
version do not support schema2 images (and these versions no longer
receives security updates).

Signed-off-by: Akihiro Suda <[email protected]>
adi-dhulipala pushed a commit to adi-dhulipala/docker that referenced this pull request Apr 11, 2019
The v1.10 layout and the migrator was added in 2015 via moby#17924.

Although the migrator is not marked as "deprecated" explicitly in
cli/docs/deprecated.md, I suppose people should have already migrated
from pre-v1.10 and they no longer need the migrator, because pre-v1.10
version do not support schema2 images (and these versions no longer
receives security updates).

Signed-off-by: Akihiro Suda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.