Add documentation for EROFS layer formats#12703
Add documentation for EROFS layer formats#12703dmcgowan wants to merge 1 commit intocontainerd:mainfrom
Conversation
Includes media types, annotations, and formats for supporting native EROFS. The new types can be used with the existing implementation as well as used as a basis for future implementations with dm-verity support and lazy loading. Signed-off-by: Derek McGowan <[email protected]>
| | Uncompressed Size | 8 | Total size of the uncompressed data | | ||
| | Chunk Size | 4 | Size of each uncompressed chunk (e.g., 4mib) | | ||
| | Hash Algo | 1 | Algorithm for chunk checksums (0=None, 1=SHA-256) | | ||
| | Hash Size | 1 | Size of the hash in bytes (e.g., 32 for SHA-256) | |
There was a problem hiding this comment.
I don't have major comments, I think Hash Size here can be rephrased as entry size - sizeof(Block Offset) (although I'm not sure about the naming), therefore chunk entry can be extended compatibly so that old containerd can still support this.
There was a problem hiding this comment.
because the hash size itself can be derived from hash algo (otherwise, even we got the hash size we still cannot utilize it, so just recording hash size is less useful).
There was a problem hiding this comment.
We can remove and just have reserved field. I was thinking algorithm would be sha-2 and could be sha256 or sha512. We could just have those algorithms specifically defined though as specific configurations.
There was a problem hiding this comment.
We can remove and just have reserved field. I was thinking algorithm would be sha-2 and could be sha256 or sha512. We could just have those algorithms specifically defined though as specific configurations.
I mainly consider the forward-compatbility of this, for example, currently we have 0 - none and 1 - sha256, but the future containerd may have 2 - sha512 support for example: since it's an unknown 2 for older containerds, those containerd versions cannot even parse block offset correctly (due to unknown chunk entry size out of the unknown hash algorithm 2). That is why I think chunk entry size seems to be useful in some cases. (Or just use seperate chunk arrays for chunk indexing and chunk hashing.)
And could we just consider supporting sha512 in this first version (since it will be used in the foreseen future)?
| | +-----------------------------------+ | | ||
| | | | ||
| +-----------------------------------------+ <--- Offset via Annotation | ||
| | | |
There was a problem hiding this comment.
can we move this part prior to Chunk Mapping Table?
I think raw erofs image and DM-Verity Hash Data should be as the whole component, and chunk mapping table can be regarded as OOB data, for lazy pulling for example
There was a problem hiding this comment.
Related to my Q above about whether to wrap dm-verity data in a skippable frame--
I think we either think of this as:
A) a fully custom format, that includes some zstd frames, or
B) a zstd stream that includes some out of band data
If (A) then I think this comment about viewing erofs image + dm-verity data as a whole component makes more sense to me, but if (B) then the erofs-image is the core data, and both the dm-verity data and the chunk-mapping table are out-of-band data.
I think the ordering doesn't really matter because it just changes the parsing, but if we're viewing it as (B) where the chunk mapping table & the dm-verity data are both extra attached data, then putting the optional dm-verity hash data last makes it a little easier to reason about the blob. If it goes last, then you don't have to reason about whether the region following the erofs image is dm-verity or the chunk mapping table-- it's always the chunk mapping table.
Not a huge deal either way though since we can find both pieces of data via the offset annotations
There was a problem hiding this comment.
I agree the ordering does not matter and both should be out of band data. Since the dm-verity data is just raw data, using the annotations and just ignoring it would be the only option as is. We could always prefix it with a magic number when wrapped in a skippable frame, although it would still not be useful data with the other dm-verity metadata. The reason we don't have a header for dm-verity is that it would just be another value that would need to have its hash checked, so we might as well rely on the annotations for offset and dm-verity parameters since the annotations are part of the already-hash checked manifest.
This is mostly a proposal to start trying something out and we can make any and all changes that make sense after further experimentation.
| | `dev.containerd.erofs.dmverity.root_digest` | **Required for Verity:** The root hash of the DM-Verity tree formatted as an OCI digest. | | ||
| | `dev.containerd.erofs.dmverity.offset` | **Required for Random Access with Verity:** Byte offset where the DM-Verity data begins. If not present, it can be recalculated and must match the root digest if provided. | | ||
| | `dev.containerd.erofs.dmverity.block_size` | **Optional:** Block size used for DM-Verity (default: 4096). | |
There was a problem hiding this comment.
The data block size and hash block size can differ, so we need to differentiate between these two values.
Additionally, we should add a nosuperblock annotation: when a superblock is present, we can read the dm-verity parameters directly from it (except for the root hash) rather than setting them in annotations.
There was a problem hiding this comment.
Do we also need an annotation for the salt for dm-verity?
There was a problem hiding this comment.
super block could be a good idea, then we would just instead include a superblock_digest and know to read it if included?
|
|
||
| The blob consists of: | ||
| 1. **Zstd Compressed Frames:** The raw EROFS filesystem image data compressed in standard [Zstd frames](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#frames). | ||
| 2. **Chunk Mapping Table:** A custom table stored inside a [Zstd Skippable Frame](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#skippable-frames). This allows standard Zstd decompressors to ignore it, while aware readers can use it to locate specific uncompressed chunks. |
There was a problem hiding this comment.
It is really helpful to have this written up explicitly, big thanks @dmcgowan
I wrote up a small proof-of-concept around this proposal to understand the options better; I have more verbose notes here, but to be concise:
-
zstd skippable frames make a lot of sense, especially since they're described in the zstd seekable spec. I was suprised there wasn't more tooling around reading & writing skippable frames, but they still seem like clearly the right choice.
-
I am curious for more context about the motivation for the custom chunk mapping table format. It seems to me that unless we really need a custom format it's preferable to use the standard zstd seek table to avoid the burden of maintaining a custom format.
The most significant difference I see is the zstd seekable spec provides checksums of the uncompressed data, the proposal describes checksums of compressed data.
I don't see much difference between the two-- I imagine you can fail faster with checksums of compressed data but that seems very minor. Is there a threat model in decompressing the data to check the checksum? Or, what am I missing here?
Using the zstd seek table also requires some minor modifications to the proposal:
- dropping the version, and
- moving the table to the end of the blob, consistent with @hsiangkao's comment here
(a requirement comes from zstd seekable format specification)
There was a problem hiding this comment.
The most significant difference I see is the zstd seekable spec provides checksums of the uncompressed data, the proposal describes checksums of compressed data.
@anniecherk For this part, if we really need to check something on transporting, we should stick to checksums of compressed data since untrusted compressed data should be checked in advance (just when it transports to the local) to avoid crafted compressed sequence for example, and the decompressed data actually can be checked by dmverity (or since compressed data is checked, since the algorithm is deterministic)
There was a problem hiding this comment.
First, actually zstd seekable format is a format in contrib directory as you may notice, it sounds standardized but as for me, it's just more-or-less random format using public zstd apis.
I mentioned it to @dmcgowan weeks ago, but IMHO, basically it has two basic disadvantages (but it depends on how it's used), the first one is that: the seek table actually cannot be random indexed, because it doen't record the absolute offsets in bytes, it records the sizes instead. Unless the seek table is small (e.g. using MiBs chunks), I guess @dmcgowan would like a random index jump table too.
the second one is what I mentioned in the previous comment, the hard requirement for transport checksumming should be compressed checksums, rather than decompressed checksums. It can be used to ensure the compressed data is valid (or as expected as the image maker) before decompression.
There was a problem hiding this comment.
I think 4MiB seems reasonable chunk size. I could see how it is advantageous for the zstd seekable format to have variable length chunks, but if the compressed file is a block device it doesn't matter. Random access could be nice but could also be an unneeded optimization and the entire table would already need to be read for digesting anyway. The pointer storage is the same (8 byte pointer vs 4+4 byte sizes), the checksum is the bigger difference (4 bytes vs 32). Having the cryptographic hash is nice to preserve the merkle tree based on what is distributed over the wire and allowing hash check before processing. The biggest issue I could see is using frame size information before cryptographic verification leading to memory exhaustion attacks, although that would likely be limited with zstd anyway by having limits on the frame size.
The table data either way is relatively small and I think more than reasonable to fit in memory. With 40 bytes per 4MiB chunk, thats only 10K per GiB. The dm-verity data would be much larger but already randomly seekable and memory-efficient.
| The blob consists of: | ||
| 1. **Zstd Compressed Frames:** The raw EROFS filesystem image data compressed in standard [Zstd frames](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#frames). | ||
| 2. **Chunk Mapping Table:** A custom table stored inside a [Zstd Skippable Frame](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#skippable-frames). This allows standard Zstd decompressors to ignore it, while aware readers can use it to locate specific uncompressed chunks. | ||
| 3. **DM-Verity Hash Data (Optional):** Appended at the end. |
There was a problem hiding this comment.
Does it make sense to also wrap the dm-verity data in a skippable zstd frame?
- if we wrap it in a skippable frame, then the whole blob is a valid zstd stream
- this makes it easier for readers that want to eagerly get the whole erofs-image rather than lazily indexing into the chunks
- if we don't then i think readers that don't explicitly know about this erofs format will choke
- we could also wrap it in a normal (non-skippable) zstd data frame, but then the dm-verity data would be decompressed as part of the same data as the erofs image, so that seems undesirable
There was a problem hiding this comment.
Yes, absolutely. I think uncompressed (no point in compressing hashes) and the dm-verity offset would still be to the beginning of the data rather than to the skippable frame.
| | Field | Size (Bytes) | Description | | ||
| | :--- | :--- | :--- | | ||
| | Block Offset | 8 | Absolute offset in the blob where this compressed chunk begins | | ||
| | Checksum | N | (Optional) Checksum of the *compressed* block data. Size `N` defined in Header. | |
There was a problem hiding this comment.
just to explicitly check details here:
- the block offset should point to the offset where the frame corresponding to that chunk index starts
- the frame consists of: magic number, header, data blocks, (decompressed data) checksum
- the checksum here is the checksum of only the compressed data blocks in that frame, excluding the magic number, header & decompressed content checksum, but including the block headers
is that accurate?
There was a problem hiding this comment.
the checksum here is the checksum of only the compressed data blocks in that frame, excluding the magic number, header & decompressed content checksum, but including the block headers
I was thinking the checksum would be from the block offset to the next block offset. We would need to carefully consider what the end of the last block would be if there were other skippable frames before the chunk map skippable frame.
|
I started working out an implementation for this format here. I wanted to make it easy to review & align so I first wanted to iron out just the skeleton of the control flow and structure. I've documented what I believe all the use-cases are, and set up function signatures that I believe fulfill those use-cases. I'll start working on implementing these two paths:
Would love to hear any and all feedback on the structure @dmcgowan @hsiangkao @ChengyuZhu6 |
|
|
||
| * **`application/vnd.erofs.layer.v1+zstd`** | ||
| * Zstd compressed EROFS filesystem. | ||
| * MUST contain the Chunk Mapping Table in a skippable frame for random access. |
There was a problem hiding this comment.
Is this indicating that zstd EROFS layers must support random access?
Should this be required? It will almost certainly increase compressed layer size.
There was a problem hiding this comment.
No, this should be reworded. This is trying to say the chunk mapping is needed in order to support random access. Random access is not a requirement though.
|
What's the current status? |
|
|
||
| * **Random Access at Runtime (Lazy Loading):** Containers can start immediately without waiting for the entire image to download or unpack. Data is fetched on-demand. | ||
| * **End-to-End Integrity:** Using DM-Verity, the integrity of the filesystem can be verified at the block level by the kernel during runtime, providing stronger security guarantees than file-level checksums. | ||
| * **Efficient Distribution:** Distributing the filesystem image directly avoids local conversion steps (e.g., unpacking tarballs and creating filesystems), reducing startup latency and CPU usage. |
There was a problem hiding this comment.
there is no lazy pulling implementation yet so no.
hi @AkihiroSuda It's no longer containerd 2.3 feature, the current approach just wraps erofs with zstd simply. |
This is a draft proposal, I'm opening as PR rather than issue to make it easier to comment
Includes media types, annotations, and formats for supporting native EROFS. The new types can be used with the existing implementation as well as used as a basis for future implementations with dm-verity support and lazy loading.
Related PRs defining native EROFS types
Also related to #12502