-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Add documentation for EROFS layer formats #12703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| # EROFS Layer Types for OCI Images | ||
|
|
||
| ## 1. Introduction & Motivation | ||
|
|
||
| This document proposes a specification for distributing EROFS (Enhanced Read-Only File System) layers within OCI (Open Container Initiative) images. | ||
|
|
||
| EROFS is a read-only filesystem designed for high performance and storage efficiency. Enabling native EROFS layers in container images offers several key advantages: | ||
|
|
||
| * **Random Access at Runtime (Lazy Loading):** Containers can start immediately without waiting for the entire image to download or unpack. Data is fetched on-demand. | ||
| * **End-to-End Integrity:** Using DM-Verity, the integrity of the filesystem can be verified at the block level by the kernel during runtime, providing stronger security guarantees than file-level checksums. | ||
| * **Efficient Distribution:** Distributing the filesystem image directly avoids local conversion steps (e.g., unpacking tarballs and creating filesystems), reducing startup latency and CPU usage. | ||
|
|
||
| ## 2. Layer Formats | ||
|
|
||
| This specification defines two primary layer formats: **Uncompressed** and **Compressed (Zstd)**. Both formats support optional DM-Verity data for runtime integrity. | ||
|
|
||
| ### 2.1 Uncompressed EROFS Layer | ||
|
|
||
| This layer consists of a standard EROFS filesystem image. It may optionally append DM-Verity hash tree data at the end of the blob. | ||
|
|
||
| **Structure:** | ||
|
|
||
| ```text | ||
| +-----------------------------------------+ | ||
| | | | ||
| | EROFS Filesystem Image | | ||
| | (Standard EROFS superblock & data) | | ||
| | | | ||
| +-----------------------------------------+ | ||
| | | | ||
| | DM-Verity Hash Data (Optional) | | ||
| | (Appended at the end) | | ||
| | | | ||
| +-----------------------------------------+ | ||
| ``` | ||
|
|
||
| ### 2.2 Compressed EROFS Layer (Zstd) | ||
|
|
||
| This layer uses Zstd compression to reduce transfer size. Unlike standard compressed layers (like `.tar.gzip`), this format supports random access by utilizing a **Chunk Mapping Table**. | ||
|
|
||
| The blob consists of: | ||
| 1. **Zstd Compressed Frames:** The raw EROFS filesystem image data compressed in standard [Zstd frames](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#frames). | ||
| 2. **Chunk Mapping Table:** A custom table stored inside a [Zstd Skippable Frame](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#skippable-frames). This allows standard Zstd decompressors to ignore it, while aware readers can use it to locate specific uncompressed chunks. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is really helpful to have this written up explicitly, big thanks @dmcgowan I wrote up a small proof-of-concept around this proposal to understand the options better; I have more verbose notes here, but to be concise:
The most significant difference I see is the zstd seekable spec provides checksums of the uncompressed data, the proposal describes checksums of compressed data. I don't see much difference between the two-- I imagine you can fail faster with checksums of compressed data but that seems very minor. Is there a threat model in decompressing the data to check the checksum? Or, what am I missing here? Using the zstd seek table also requires some minor modifications to the proposal:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@anniecherk For this part, if we really need to check something on transporting, we should stick to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First, actually
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think 4MiB seems reasonable chunk size. I could see how it is advantageous for the zstd seekable format to have variable length chunks, but if the compressed file is a block device it doesn't matter. Random access could be nice but could also be an unneeded optimization and the entire table would already need to be read for digesting anyway. The pointer storage is the same (8 byte pointer vs 4+4 byte sizes), the checksum is the bigger difference (4 bytes vs 32). Having the cryptographic hash is nice to preserve the merkle tree based on what is distributed over the wire and allowing hash check before processing. The biggest issue I could see is using frame size information before cryptographic verification leading to memory exhaustion attacks, although that would likely be limited with zstd anyway by having limits on the frame size. The table data either way is relatively small and I think more than reasonable to fit in memory. With 40 bytes per 4MiB chunk, thats only 10K per GiB. The dm-verity data would be much larger but already randomly seekable and memory-efficient. |
||
| 3. **DM-Verity Hash Data (Optional):** Appended at the end. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it make sense to also wrap the dm-verity data in a skippable zstd frame?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, absolutely. I think uncompressed (no point in compressing hashes) and the dm-verity offset would still be to the beginning of the data rather than to the skippable frame. |
||
|
|
||
| **Structure:** | ||
|
|
||
| ```text | ||
| +-----------------------------------------+ <--- Start of Blob | ||
| | | | ||
| | Zstd Compressed Frames | | ||
| | (Compressed Raw EROFS Image) | | ||
| | | | ||
| +-----------------------------------------+ <--- Offset via Annotation | ||
| | | | ||
| | Skippable Zstd Frame (0x184D2A5E) | | ||
| | +-----------------------------------+ | | ||
| | | Chunk Mapping Table | | | ||
| | +-----------------------------------+ | | ||
| | | | ||
| +-----------------------------------------+ <--- Offset via Annotation | ||
| | | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we move this part prior to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Related to my Q above about whether to wrap dm-verity data in a skippable frame-- I think we either think of this as: If (A) then I think this comment about viewing erofs image + dm-verity data as a whole component makes more sense to me, but if (B) then the erofs-image is the core data, and both the dm-verity data and the chunk-mapping table are out-of-band data. I think the ordering doesn't really matter because it just changes the parsing, but if we're viewing it as (B) where the chunk mapping table & the dm-verity data are both extra attached data, then putting the optional dm-verity hash data last makes it a little easier to reason about the blob. If it goes last, then you don't have to reason about whether the region following the erofs image is dm-verity or the chunk mapping table-- it's always the chunk mapping table. Not a huge deal either way though since we can find both pieces of data via the offset annotations
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree the ordering does not matter and both should be out of band data. Since the dm-verity data is just raw data, using the annotations and just ignoring it would be the only option as is. We could always prefix it with a magic number when wrapped in a skippable frame, although it would still not be useful data with the other dm-verity metadata. The reason we don't have a header for dm-verity is that it would just be another value that would need to have its hash checked, so we might as well rely on the annotations for offset and dm-verity parameters since the annotations are part of the already-hash checked manifest. This is mostly a proposal to start trying something out and we can make any and all changes that make sense after further experimentation. |
||
| | DM-Verity Hash Data (Optional) | | ||
| | | | ||
| +-----------------------------------------+ <--- End of Blob | ||
| ``` | ||
|
|
||
| ## 3. Binary Format Specification | ||
|
|
||
| All multi-byte integers are stored in **Little-Endian** format unless otherwise specified. | ||
|
|
||
| ### 3.1 Chunk Mapping Table | ||
|
|
||
| The Chunk Mapping Table allows mapping uncompressed offsets to compressed ranges within the Zstd stream. It is stored as the payload of a Zstd Skippable Frame. | ||
|
|
||
| **Header:** | ||
|
|
||
| | Field | Size (Bytes) | Description | | ||
| | :--- | :--- | :--- | | ||
| | Magic | 4 | Magic number (`0xCD 0xE4 0xEC 0x67`) | | ||
| | Version | 4 | Format version (currently `1`) | | ||
| | Uncompressed Size | 8 | Total size of the uncompressed data | | ||
| | Chunk Size | 4 | Size of each uncompressed chunk (e.g., 4mib) | | ||
| | Hash Algo | 1 | Algorithm for chunk checksums (0=None, 1=SHA-256) | | ||
| | Hash Size | 1 | Size of the hash in bytes (e.g., 32 for SHA-256) | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have major comments, I think
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. because the hash size itself can be derived from
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can remove and just have reserved field. I was thinking algorithm would be sha-2 and could be sha256 or sha512. We could just have those algorithms specifically defined though as specific configurations.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I mainly consider the forward-compatbility of this, for example, currently we have And could we just consider supporting sha512 in this first version (since it will be used in the foreseen future)? |
||
| | Reserved | 2 | Reserved for future use (must be 0) | | ||
|
|
||
| **Chunk Entry:** | ||
|
|
||
| There is one entry for every chunk. The index of the entry corresponds to the chunk index (Uncompressed Offset / Chunk Size). | ||
|
|
||
| | Field | Size (Bytes) | Description | | ||
| | :--- | :--- | :--- | | ||
| | Block Offset | 8 | Absolute offset in the blob where this compressed chunk begins | | ||
| | Checksum | N | (Optional) Checksum of the *compressed* block data. Size `N` defined in Header. | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just to explicitly check details here:
is that accurate?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was thinking the checksum would be from the block offset to the next block offset. We would need to carefully consider what the end of the last block would be if there were other skippable frames before the chunk map skippable frame. |
||
|
|
||
| *Note: The size of the compressed block is calculated by: `NextEntry.Offset - CurrentEntry.Offset`.* | ||
|
|
||
| ### 3.2 DM-Verity Data | ||
|
|
||
| If present, the DM-Verity data is a raw dump of the Merkle tree used by the Linux kernel `dm-verity` target. Its location is defined by an OCI annotation. | ||
|
|
||
| ## 4. OCI Integration | ||
|
|
||
| ### 4.1 Media Types | ||
|
|
||
| * **`application/vnd.erofs.layer.v1`** | ||
| * Uncompressed EROFS filesystem. | ||
| * May include inline EROFS compression (handled internally by EROFS), but the layer blob itself is not compressed for distribution. | ||
| * Optional DM-Verity data appended. | ||
|
|
||
| * **`application/vnd.erofs.layer.v1+zstd`** | ||
| * Zstd compressed EROFS filesystem. | ||
| * MUST contain the Chunk Mapping Table in a skippable frame for random access. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this indicating that zstd EROFS layers must support random access? Should this be required? It will almost certainly increase compressed layer size.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, this should be reworded. This is trying to say the chunk mapping is needed in order to support random access. Random access is not a requirement though. |
||
| * Optional DM-Verity data appended. | ||
|
|
||
| ### 4.2 Annotations | ||
|
|
||
| Metadata required for random access and verification is passed via OCI Manifest annotations. | ||
|
|
||
| | Annotation Key | Description | | ||
| | :--- | :--- | | ||
| | `dev.containerd.erofs.zstd.chunk_table_offset` | **Required for Zstd:** Byte offset to the start of the Zstd Skippable Frame containing the Chunk Mapping Table. | | ||
| | `dev.containerd.erofs.zstd.chunk_digest` | **Required for Zstd:** Digest of the chunk map formatted as an OCI digest. | | ||
| | `dev.containerd.erofs.dmverity.root_digest` | **Required for Verity:** The root hash of the DM-Verity tree formatted as an OCI digest. | | ||
| | `dev.containerd.erofs.dmverity.offset` | **Required for Random Access with Verity:** Byte offset where the DM-Verity data begins. If not present, it can be recalculated and must match the root digest if provided. | | ||
| | `dev.containerd.erofs.dmverity.block_size` | **Optional:** Block size used for DM-Verity (default: 4096). | | ||
|
Comment on lines
+125
to
+127
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The data block size and hash block size can differ, so we need to differentiate between these two values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we also need an annotation for the salt for dm-verity?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. super block could be a good idea, then we would just instead include a |
||
|
|
||
| *Future Goal: Transition `dev.containerd.*` annotations to a standardized namespace (e.g. `org.opencontainers.*`) upon wider adoption.* | ||
|
|
||
|
|
||
| ### 4.3 Layer DiffID | ||
|
|
||
| The Layer DiffID is used by the OCI image config to uniquely identify the | ||
| uncompressed content of a layer. It is included in the rootfs section of the | ||
| config. It is important that the DiffID represents a secure hash of the content, | ||
| ensuring that the digest of an image config is an immutable representation of a | ||
| runnable image. For EROFS, either the digest of the uncompressed EROFS filesystem | ||
| image or the root hash of the DM-Verity tree can be used as the DiffID. If the | ||
| DM-Verity data is present, it must be used as the DiffID. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any benchmark results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no lazy pulling implementation yet so no.