Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design: what checksum to use for snapshot content #38

Open
tonistiigi opened this issue Jun 23, 2017 · 5 comments
Open

design: what checksum to use for snapshot content #38

tonistiigi opened this issue Jun 23, 2017 · 5 comments

Comments

@tonistiigi
Copy link
Member

docker build uses a hash that contains some selection of tar headers together with the file data for detecting data changes. For directories, the hashes of subfiles are used.

In some cases, it is not correct, for example, the hash would change if a filename changes while it has no effect on the destination where it would be copied.

A bigger problem is that it does not support hardlinks. As the hash is over a tar header the second link would only hash the parent path and be completely wrong. Example of this is in https://gist.github.com/tonistiigi/775cb15d3918958020bdd0165f776005

One other solution would be to use https://github.com/containerd/continuity . A problem with that is that it is not a tree hash, so getting a hash of a directory means recursively walking it. mtree has similar problems.

We could make our hash version if we pick the right headers and solve a way to normalize hardlinks. As these checksums are local to builder, we could use a faster crypto-hash than sha256 as well.

At first, we should probably go with the current hash as everything needed for that is mostly already implemented in https://github.com/moby/moby/blob/master/builder/remotecontext/tarsum.go . But if we want to fix these above issues, now would be a time to do it.

@dmcgowan

@philtay
Copy link

philtay commented Jun 23, 2017

Not sure if this will help, but have a look at BLAKE2. It's fast, strong and natively supports tree hashing.

@AkihiroSuda
Copy link
Member

Hashing all the metadata is not easy (although I'd like to see continuity to evolve to handle that).

So I think tarsum is OK atm, but should allow specifying other algorithms by design.

@tonistiigi
Copy link
Member Author

@AkihiroSuda So would you ignore the hardlink issue or fix it while still using the tar headers?

@AkihiroSuda
Copy link
Member

Oh, I was misunderstanding hardlink just causes false cache miss, but it seems actually causing false cache hit.

This hardlink issue seems to need to be fixed.

Also, can we open GitHub issue about this at moby/moby repo, so that we can collect information from users about how this issue affects real usecase.

@tonistiigi
Copy link
Member Author

tonistiigi commented Nov 1, 2017

I'll leave this open as we can make changes if needed but contenthash package added in #90 uses same headers are tarsum but fixes the filename/hardlink/symlink issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants