-
Notifications
You must be signed in to change notification settings - Fork 800
config: provide more complete explanation of ChainID #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,19 +25,61 @@ Changing it means creating a new derived image, instead of changing the existing | |
|
|
||
| ### Layer DiffID | ||
|
|
||
| A layer DiffID is a SHA256 digest over the layer's uncompressed tar archive and serialized in the descriptor digest format, e.g., `sha256:a9561eb1b190625c9adb5a9513e72c4dedafc1cb2d4c5236c9a6957ec7dfd5a9`. | ||
| A layer DiffID is the digest over the layer's uncompressed tar archive and serialized in the descriptor digest format, e.g., `sha256:a9561eb1b190625c9adb5a9513e72c4dedafc1cb2d4c5236c9a6957ec7dfd5a9`. | ||
| Layers must be packed and unpacked reproducibly to avoid changing the layer DiffID, for example by using tar-split to save the tar headers. | ||
|
|
||
| NOTE: Do not confuse DiffIDs with [layer digests](manifest.md#image-manifest-property-descriptions), often referenced in the manifest, which are digests over compressed or uncompressed content. | ||
|
|
||
| ### Layer ChainID | ||
|
|
||
| For convenience, it is sometimes useful to refer to a stack of layers with a single identifier. | ||
| This is called a `ChainID`. | ||
| For a single layer (or the layer at the bottom of a stack), the | ||
| `ChainID` is equal to the layer's `DiffID`. | ||
| Otherwise the `ChainID` is given by the formula: | ||
| `ChainID(layerN) = SHA256hex(ChainID(layerN-1) + " " + DiffID(layerN))`. | ||
| While a layer's `DiffID` identifies a single changeset, the `ChainID` identifies the subsequent application of those changesets. | ||
| This ensures that we have handles referring to both the layer itself, as well as the result of the application of a series of changesets. | ||
| Use in combination with `rootfs.diff_ids` while applying layers to a root filesystem to uniquely and safely identify the result. | ||
|
|
||
| #### Definition | ||
|
|
||
| The `ChainID` of an applied set of layers is defined with the following recursion: | ||
|
|
||
| ``` | ||
| ChainID(L₀) = DiffID(L₀) | ||
| ChainID(L₀|...|Lₙ₋₁|Lₙ) = Digest(ChainID(L₀|...|Lₙ₋₁) + " " + DiffID(Lₙ)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Otherwise the ChainID could be the result of different algorithm choices at each stage, and the final ChainID digest only identifies the algorithm for the final digest.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, it is well-defined.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Here's an explicit example. Consider the following DiffID array: Your PR makes it very clear that Now say the chain ID implementation decides to take a Good. Say they decided to use Also good. Clearly not the same as the sha256 version, but the algorithm identifier makes that clear. Moving on to And the fellow who used That's two possible a. Require a particular algorithm identifier for all chainID digests (this is independent of the algorithm identifier used for the diffIDs themselves) or (a) could be phrased as: and (b) could be phrased as: where the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wking this is defining the mathematical formula for a digest and reasoning behind it. Defining the algorithm or making requirements about algorithms matching is unrelated to the definition of a chain ID. This is a perfectly valid chain id as the properties are reserved. The processing creating the chain ID is also responsible for providing the diff IDs, the algorithms of those diff digests are completely irrelevant to the chain id computation and the algorithm used for the end chain ID does not need to match the diff IDs. The chain ID algorithm can be set by the implementation of the chain ID and as with all cases where a digest is used, no algorithm should be preferred by specification.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed.
I agree that defining SHA-256 is unrelated, but I don't agree that algorithm selection is unrelated to the chainID definition.
I'm unclear on what you mean by “the properties are reserved”.
This is not necessarily true. DiffIDs are part of the distributed config, so they are created by the config author. ChainIDs may be created by the config author (in which case your statement is true), but they may also be created by the config consumer (e.g. for identifying directories with unpacked/applied stacks of layers) and in that case your statement is false.
Agreed.
That sounds like an argument for the Leaving the algorithm choice unrestricted (and thus allowing algorithm swaps), means that there is no guess-free way of validating
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wking You are assigning uncertainty where there is none. I please ask that you take time to understand the subject before commenting with authority. If there are questions you have, please don't be afraid to ask, but please pose them as questions. I also believe that you are misunderstanding the notation. First off, the expression Either way, the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Then why bother defining it at all? Can't we just point out that using a hash of the final layer is (obviously?) not unique to one particular array of layers and is a property shared by all arrays ending with that layer. I don't see the point of spending ~50 lines defining and explaining |
||
| ``` | ||
|
|
||
| For this, we define the binary `|` operation to be the result of applying the right operand to the left operand. | ||
| For example, given base layer `A` and a changeset `B`, we refer to the result of applying `B` to `A` as `A|B`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can avoid defining
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
But that is not the definition of a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not as your PR stands with 0fa3cc5, but with a string-based ChainID definition, it's easy enough to couple back in your choice of On the other hand, it's clearly impossible to decouple the Your implementation seems to agree by accepting |
||
|
|
||
| Above, we define the `ChainID` for a single layer (`L₀`) as equivalent to the `DiffID` for that layer. | ||
| Otherwise, the `ChainID` for `L₀|...|Lₙ₋₁|Lₙ` is defined as recursion `Digest(ChainID(L₀|...|Lₙ₋₁) + " " + DiffID(Lₙ))`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph doesn't add anything to the earlier definition that I can see. There's already enough going on here that I don't think repeating that definition here adds anything, so I'd like to drop this paragraph.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is called adding context. For those unfamiliar with recursion, or who practice literate reading, this is very helpful. It helps to assign an english meaning to an otherwise obtuse syntax.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So keep this version and drop the earlier version? The earlier version is not much more compact than this paragraph, so if it is obtuse in any way I don't see the point of including it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Different people understand things it different ways. |
||
|
|
||
| #### Explanation | ||
|
|
||
| Let's say we have layers A, B, C, ordered from bottom to top, where A is the base and C is the top. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Switching from
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a specific example verses a generic case.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. “A, B, C” is not a specific example. A specific example would be “the array of strings |
||
| Defining `|` as a binary application operator, the root filesystem may be `A|B|C`. | ||
| While it is implied that `C` is only useful when applied to `A|B`, the identifier `C` is insufficient to identify this result, as we'd have the equality `C = A|B|C`, which isn't true. | ||
|
|
||
| The main issue is when we have two definitions of `C`, `C = C` and `C = A|B|C`. If this is true (with some handwaving), `C = x|C` where `x = any application` must be true. | ||
| This means that if an attacker can define `x`, relying on `C` provides no guarantee that the layers were applied in any order. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These two paragraphs seem like they could be clarified. How about something like:
Then follow up with your next paragraph reiterating that the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You keep trying to model this as an array of strings. It is not.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hashing anything is an operation on byte strings. If you want to keep the |
||
|
|
||
| The `ChainID` addresses this problem by being defined as a compound hash. | ||
| __We differentiate the changeset `C`, from the order dependent application `A|B|C` by saying that the resulting rootfs is identified by ChainID(A|B|C), which can be calculated by `ImageConfig.rootfs`.__ | ||
|
|
||
| Let's expand the definition of `ChainID(A|B|C)` to explore its internal structure: | ||
|
|
||
| ``` | ||
| ChainID(A) = DiffID(A) | ||
| ChainID(A|B) = Digest(ChainID(A) + " " + DiffID(B)) | ||
| ChainID(A|B|C) = Digest(ChainID(A|B) + " " + DiffID(C)) | ||
| ``` | ||
|
|
||
| We can replace the each definition and reduce to a single equality: | ||
|
|
||
| ``` | ||
| ChainID(A|B|C) = Digest(Digest(DiffID(A) + " " + DiffID(B)) + " " + DiffID(C)) | ||
| ``` | ||
|
|
||
| Hopefully, the above is illustrative of the _actual_ contents of the `ChainID`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hopefully? If there is any doubt, can we just use real values. With the array-of-strings formulation, you can use: Which has the additional benefit of being an easy unit test for any
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are being hopeful about understanding of the reader, not the result. It is clear that "hopeful" is the right sentiment, as I know of at least one reader who is struggling here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right. That's a sign that we doubt the clarity of the current wording, so we should hunt for wording that is clear enough to make us not doubt the reader's understanding.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wking Everyone else I have shown this to seems to understand.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'd still prefer real values. But if many people understand, and only one does not understand, and you feel like there is no room for clarification to help that remaining person, I think that's enough reason to drop this sentence. Letting them know that you hope they will understand is unlikely to impact their understanding. And while we hope readers understand everything in the spec, I think it's appropriate that we don't burn a line saying “We hope you understand this specification” in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not that I don't feel there is no room for clarification. It's that within a day of submitting this PR, which is not contentious at all, you have made upwards of 30 comments on it. I can't tell if you are trying to slow OCI down for nefarious purposes or genuinely think your helping. |
||
| Most importantly, we can easily see that `ChainID(C) != ChainID(A|B|C)`, otherwise, `ChainID(C) = DiffID(C)`, which is the base case, could not be true. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It mirrors the structure of the definition, the prose description and You need to differentiate between "I don't understand the implications of this equality" from "I don't understand the strength or reasoning behind these implications". The whole point here is that
This is not the L₀ case. This is why we revert the to lettering to make this clear.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In fact, If these aspects aren't clear to you, why are commenting with such authority? It seems like you don't have a basic understand of this problem or its solution, yet you are attempting to comment on it as if you have a full understanding or even expertise on the matter. I don't mind further clarifying this, but it makes it hard to have a conversation when you pose your attempt to understand as contrarian statements.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
To have
This is true for
Because I think these aspects are clear to me?
I think pushing back on “this doesn't make sense to me” is a good way to have a discussion that gets us to agreement. There are alternatives though. For example, I could file a PR with my own take on clarifying
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| ### ImageID | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's lifting the requirement for DiffIDs to be SHA-256. That might be useful, but seems orthogonal enough to the ChainID definition that it should be part of a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not a requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the old wording (“A layer DiffID is a SHA256 digest”) it certainly seems like using a
sha256digest is a requirement. Regardless of the original intent, clarifying this point seems orthogonal to documenting the chain ID. But shrug, I'm not a maintainer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHA256 was just inherited from the original description of Docker's use. Best to not mention specific hashes except where explicitly required to avoid this type of uncertainty for future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're making (or clarifying) that the algorithm isn't fixed, then I'd also recommend using “a digest” instead of “the digest”, because the different algorithm choices allow multiple valid digests for the same content (e.g.
sha256:2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7aeandsha512:f7fbba6e0636f890e56fbbf3283e524c6fa3204ae298382d624741d0dc6638326e282c41be5e4254d8820772c5518a2c5a8c0c7f7eda19594a7eb539453e1ed7are both valid digests forfoo).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the digest" refers to "the digest" of the layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A given layer can have multiple digests (see my foo example). That means that an indefinite article is more appropriate than a definite article. Wikipedia gives the following example:
In this case, the lack of an algorithm requirement means that there are several potential digests (one for each possible algorithm identifier), and we are not specifying which of those digests is intended (all are equally legal).