Seekable-erofs#12764
Conversation
0a5f9bd to
73a9f02
Compare
|
hi @dmcgowan @hsiangkao @ChengyuZhu6 @aadhar-agarwal, could you please take a look at this PR at your leisure? I was debating whether to break it into smaller PRs / commits and left it as one for now-- if that or any other info would make it easier to review pls lmk, happy to update. |
|
I really think breaking the commit "Add seekable erofs" is useful, also I wonder if the seekable format is dedicated to erofs? if so, how about moving into |
73a9f02 to
83f5d36
Compare
2576204 to
71a7413
Compare
|
@hsiangkao thank you for taking a look!
makes sense, I updated into a series of commits-- hopefully that makes it easier to follow. Let me know if there's anything else I can do there. I could also try to split into a few PRs if that feels more manageable.
great, updated |
721d243 to
ffb1383
Compare
ffb1383 to
15318f1
Compare
16e292e to
15318f1
Compare
15318f1 to
7808656
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for a new seekable EROFS (+zstd) layer media type, including conversion (ctr images convert) and decoding/materialization in the EROFS differ and display tooling.
Changes:
- Introduces
application/vnd.erofs.layer.v1+zstdmedia type with helper detection APIs. - Adds seekable EROFS encoder/decoder (chunked zstd frames + skippable chunk table + optional dm-verity payload) with tests.
- Hooks seekable EROFS handling into the EROFS differ, image converter, and manifest printer (chunk table + directory preview).
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| plugins/diff/erofs/differ.go | Detects seekable EROFS media type and eagerly decodes to layer.erofs, optionally materializing dm-verity. |
| pkg/display/manifest_printer.go | Enhances verbose output to show seekable EROFS chunk table and (optionally) directory tree. |
| internal/erofsutils/seekable/zstderofs_test.go | Adds tests for zstd framing decode and seekable EROFS round-trips. |
| internal/erofsutils/seekable/zstderofs.go | Implements full-stream and per-frame zstd decoding helpers. |
| internal/erofsutils/seekable/zstd.go | Adds zstd skippable frame helpers and per-frame compressed checksum computation. |
| internal/erofsutils/seekable/io.go | Adds a counting writer to track offsets on non-seekable outputs (content.Writer). |
| internal/erofsutils/seekable/image.go | Implements seekable EROFS blob encoding (frames + chunk table + optional dm-verity). |
| internal/erofsutils/seekable/doc.go | Documents the seekable EROFS blob format at package level. |
| internal/erofsutils/seekable/dmverity_test.go | Adds dm-verity metadata/payload tests (linux-only). |
| internal/erofsutils/seekable/dmverity_other.go | Stubs dm-verity APIs for non-Linux builds. |
| internal/erofsutils/seekable/dmverity_linux.go | Implements dm-verity payload writing and unpack-time materialization (linux-only). |
| internal/erofsutils/seekable/chunktable_test.go | Adds chunk table serialization/digest-validation tests. |
| internal/erofsutils/seekable/chunktable.go | Implements chunk table on-disk format, digest validation, and frame bounds helpers. |
| internal/erofsutils/seekable/annotations.go | Defines annotations for chunk table and dm-verity metadata. |
| internal/erofsutils/seekable/README.md | Adds a detailed format/design writeup for seekable EROFS. |
| core/images/mediatypes.go | Adds MediaTypeErofsLayerZstd and IsSeekableErofsMediaType. |
| core/images/converter/erofs/seekable.go | Adds converter implementation for seekable EROFS, with annotations and labels. |
| core/images/converter/erofs/erofs.go | Adds raw EROFS conversion logic used by the seekable converter path. |
| cmd/ctr/commands/images/convert.go | Adds ctr images convert flags for raw and seekable EROFS conversion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| labelsMap[labels.LabelUncompressed] = digester.Digest().String() | ||
|
|
||
| if err := contentWriter.Commit(ctx, 0, "", content.WithLabels(labelsMap)); err != nil && !errdefs.IsAlreadyExists(err) { |
| dst := bytes.NewBuffer(nil) | ||
| json.Indent(dst, cb, prefix+"│", " ") | ||
| fmt.Fprintf(p.w, "%s┌────────Content────────\n", prefix) | ||
| fmt.Fprintf(p.w, "%s│%s\n", prefix, strings.TrimSpace(dst.String())) |
| // Decompress and show EROFS directory tree, unless the image is too large. | ||
| if tbl != nil && tbl.Header.UncompressedSizeBytes > maxDecompressBytes { | ||
| fmt.Fprintf(p.w, "%s│ (EROFS image is %.1f MiB, too large to display directory tree)\n", | ||
| prefix, float64(tbl.Header.UncompressedSizeBytes)/(1024*1024)) | ||
| } else { | ||
| blobSize := blob.Size() | ||
| blobReader := io.NewSectionReader(blob, 0, blobSize) | ||
| var decompBuf bytes.Buffer | ||
| if _, decErr := seekable.DecodeErofsAll(ctx, blobReader, &decompBuf); decErr != nil { |
|
|
||
| // Return descriptor with the diffID (digest of uncompressed EROFS content). | ||
| return ocispec.Descriptor{ | ||
| MediaType: ocispec.MediaTypeImageLayer, |
1d7da56 to
3f06552
Compare
Add EROFS conversion support to ctr convert command with configurable options for tar-index mode and mkfs parameters. Usage: ctr image convert --erofs src:tag dst:tag ctr image convert --erofs --erofs-compression='lz4hc,12' src:tag dst:tag Signed-off-by: ChengyuZhu6 <[email protected]>
Signed-off-by: Derek McGowan <[email protected]>
Add MediaTypeErofsLayerZstd (application/vnd.erofs.layer.v1+zstd) for the seekable EROFS blob format. Signed-off-by: Annie Cherkaev <[email protected]>
Add the internal/erofsutils/seekable package with foundational building blocks for the seekable EROFS format: - zstd skippable frame read/write helpers: the seekable format uses zstd skippable frames to embed the chunk table and dm-verity data inline in a valid zstd stream, so standard zstd decoders can skip over this metadata while still processing the data frames. - writeZstdFrame: compresses data as independent zstd frames and computes a SHA-512 checksum of the compressed output in a single pass (via MultiWriter). Each chunk must be an independent frame so that a future lazy-loading snapshotter can decompress individual chunks without needing prior frames as context. - countingWriter: tracks byte offsets in the output stream because containerd's content.Writer does not implement io.Seeker, yet the format requires recording absolute offsets for the chunk table entries and dm-verity frame location. Without this, we would need to buffer the entire blob in memory just to know where each section starts. Signed-off-by: Annie Cherkaev <[email protected]>
Implement the chunk mapping table, the core data structure that enables random access to compressed EROFS data without full decompression. The chunk table maps each fixed-size uncompressed byte range to the absolute offset of its corresponding independent zstd frame in the compressed blob, along with an optional SHA-512 checksum. This is what makes the format "seekable": a future lazy-loading remote snapshotter can read just the chunk table, then fetch and decompress only the specific chunks needed for a given file access—avoiding the need to download and decompress the entire layer. The table is stored as a zstd skippable frame so the overall blob remains a valid zstd stream. Since skippable frames lack built-in checksums, the table's integrity is verified separately via an OCI digest annotation (sha512:<hex>) on the layer descriptor. Also defines OCI annotation constants for chunk_table_offset, chunk_digest, dmverity.offset, and dmverity.root_digest, which carry the metadata needed to locate and verify these embedded structures within the blob. Signed-off-by: Annie Cherkaev <[email protected]>
Add a WriteMetadata function to the internal/dmverity package that atomically writes dm-verity metadata (root hash and hash offset) alongside layer blobs using atomicfile. This provides crash-safety guarantees that the previous inline os.WriteFile approach lacked. Also adds input validation: WriteMetadata rejects empty root hashes up front, symmetric with ReadMetadata's existing check. Convert the EROFS differ's formatDmverityLayer to use the new function, eliminating the inline json.MarshalIndent + os.WriteFile block. The seekable EROFS dm-verity code (next commit) will also use WriteMetadata, avoiding duplicate implementations. Signed-off-by: Annie Cherkaev <[email protected]>
Add dm-verity encoding (at conversion time) and materialization (at unpack time) for seekable EROFS blobs. Embedding dm-verity data enables end-to-end integrity verification: the Merkle tree is computed over the uncompressed EROFS image at conversion time and stored as a zstd skippable frame at the end of the blob. At unpack time, the dm-verity payload is extracted and appended to layer.erofs in "single device" mode, so the Linux dm-verity subsystem can verify every block read at mount time without requiring a separate hash device. The dm-verity superblock stores all parameters (block sizes, salt, algorithm) so they don't need to be duplicated in annotations—only the root digest and frame offset are annotated on the OCI descriptor. A small JSON sidecar (layer.erofs.dmverity) is written alongside layer.erofs using dmverity.WriteMetadata, sharing the same type and atomic-write implementation as the non-seekable EROFS dm-verity path. The OCI annotation stores the root digest in "algo:hex" format (e.g. "sha512:abcdef...") for self-documenting metadata, while MaterializeDMVerity strips the prefix before writing the sidecar, since the mount handler's ParseRootHash expects plain hex. Non-Linux platforms get stub implementations that return ErrNotImplemented, since dm-verity is a Linux kernel feature. The conversion pipeline itself only runs on Linux. Signed-off-by: Annie Cherkaev <[email protected]>
Implement the end-to-end encoding and decoding logic that ties together the zstd primitives, chunk table, and dm-verity components. EncodeErofsImageTo orchestrates the full conversion from a raw EROFS image to the seekable blob format: it reads the image in fixed-size chunks, compresses each as an independent zstd frame, builds the chunk table with frame offsets and SHA-512 checksums, and optionally appends dm-verity data. The encoder streams output through a countingWriter (since containerd's content store is not seekable) and uses a temp file for dm-verity computation when needed, because the go-dmverity library operates on files rather than streams. Two decoders support different access patterns: - DecodeErofsAll: stream-decodes the entire blob for the eager unpack path used by today's EROFS snapshotter, which needs the full raw image. - DecodeErofsFrame: decodes a single zstd frame by offset for the future lazy-loading snapshotter, verifying the frame's SHA-512 checksum in a single pass via TeeReader to avoid reading the compressed data twice. Signed-off-by: Annie Cherkaev <[email protected]>
Add SeekableLayerConvertFunc and `ctr images convert --erofs-seekable` to make the seekable EROFS format usable from the command line. SeekableLayerConvertFunc handles the three-stage conversion pipeline: 1. If the layer is already seekable EROFS (+zstd): no-op 2. If the layer is raw EROFS: wrap in zstd frames with chunk table 3. If the layer is tar-based: convert to raw EROFS first, then wrap This reuses the existing raw EROFS converter (LayerConvertFunc) for step 3, avoiding duplication of the tar-to-erofs logic. The resulting OCI descriptor carries annotations for chunk table offset/digest and dm-verity offset/root-digest, which downstream consumers (the differ, a future lazy-loading snapshotter) use to locate embedded metadata within the blob. New CLI flags: --erofs-seekable: opt into the seekable format --erofs-chunk-size: set random access granularity (default 4 MiB) --erofs-dm-verity: enable integrity verification --erofs-dm-verity-block-size: dm-verity block size (default 4 KiB) Also extracts buildErofsOpts helper to share raw EROFS option parsing between the --erofs and --erofs-seekable code paths. Signed-off-by: Annie Cherkaev <[email protected]>
Extend the EROFS differ to recognize and unpack seekable EROFS layers,
completing the "convert -> push -> pull -> unpack -> mount" lifecycle
for the new format.
The differ is the integration point between the content store and
the snapshotter: when a seekable EROFS layer is pulled, the differ
must decode it back into a mountable layer.erofs. This commit adds that
decode path:
1. Detect +zstd media type and route to the seekable decode path
2. Stream-decode the entire zstd blob to recover the raw EROFS image
(using DecodeErofsAll—the eager path ignores the chunk table since
it needs the full image anyway)
3. If dm-verity annotations are present, extract the embedded payload
from the blob and append it to layer.erofs with a JSON sidecar,
enabling mount-time integrity verification
The eager unpack path is intentionally simple—it doesn't use the chunk
table for random access. That optimization is deferred to a future
lazy-loading remote snapshotter, which will use DecodeErofsFrame and the
chunk table to fetch only the needed chunks on demand.
Also refactors IsErofsMediaType to correctly handle the new +zstd suffix
using strings.Cut, fixing a latent issue where any "+suffix" EROFS media
type would have been incorrectly rejected.
Signed-off-by: Annie Cherkaev <[email protected]>
Signed-off-by: Annie Cherkaev <[email protected]>
Extend `ctr images inspect` to display seekable EROFS layer details: chunk table metadata (uncompressed size, chunk size, hash algorithm, entry count with offset/size/checksum), OCI annotations (chunk table offset/digest, dm-verity offset/root digest), and the EROFS directory tree (decompressed on the fly, with a 256 MiB safety cap). Also adds a maxDepth parameter to PrintDirectory to bound recursion for deeply nested filesystems. Signed-off-by: Annie Cherkaev <[email protected]>
3f06552 to
1dd6f0e
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for the proposed “seekable EROFS” layer format (application/vnd.erofs.layer.v1+zstd) including encoding/decoding, ctr images convert integration, eager unpack/differ support, and optional dm-verity payload extraction/metadata handling.
Changes:
- Introduces
+zstdseekable EROFS media type and detection helpers; adds converter support to emit this format with chunk-table + optional dm-verity annotations. - Extends EROFS differ/apply path to eagerly decode seekable EROFS layers and materialize dm-verity sidecar metadata when present.
- Enhances manifest tree printing to inspect seekable EROFS layers (chunk table preview + optional directory tree) and adds tests for chunk table, zstd framing, and dm-verity helpers.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| plugins/diff/erofs/dmverity_linux.go | Switches dm-verity metadata writing to shared dmverity.WriteMetadata. |
| plugins/diff/erofs/differ.go | Adds seekable EROFS apply path (decode + diffID computation + dm-verity materialization). |
| pkg/display/manifest_printer.go | Adds verbose inspection for seekable EROFS layers (annotations, chunk table, directory preview). |
| internal/erofsutils/seekable/zstderofs_test.go | Adds tests for decode-all / per-frame decoding and round-trips. |
| internal/erofsutils/seekable/zstderofs.go | Implements stream decode and single-frame decode with optional checksum verification. |
| internal/erofsutils/seekable/zstd.go | Adds skippable frame helpers and zstd frame writing with SHA-512 checksum. |
| internal/erofsutils/seekable/io.go | Adds counting writer for offset tracking during streaming writes. |
| internal/erofsutils/seekable/image.go | Implements seekable EROFS encoder (chunked frames + chunk table + optional dm-verity). |
| internal/erofsutils/seekable/doc.go | Package documentation for the seekable EROFS format and goals. |
| internal/erofsutils/seekable/dmverity_test.go | Adds dm-verity payload/sidecar tests for seekable EROFS. |
| internal/erofsutils/seekable/dmverity_other.go | Stubs dm-verity functionality on non-Linux platforms. |
| internal/erofsutils/seekable/dmverity_linux.go | Implements dm-verity skippable frame writing and materialization into layer + sidecar. |
| internal/erofsutils/seekable/chunktable_test.go | Adds chunk-table serialization/digest validation and bounds tests. |
| internal/erofsutils/seekable/chunktable.go | Implements chunk table on-disk format, read/verify, and write helpers. |
| internal/erofsutils/seekable/annotations.go | Defines OCI annotations for chunk table and dm-verity metadata. |
| internal/erofsutils/seekable/README.md | Documents format layout, use-cases, and detailed structures. |
| internal/dmverity/dmverity_test.go | Adds unit test coverage for atomic metadata writing. |
| internal/dmverity/dmverity.go | Adds atomic JSON writer for dm-verity metadata sidecar. |
| go.mod | Promotes golang.org/x/term to a direct dependency. |
| core/images/mediatypes.go | Adds seekable EROFS (+zstd) media type constant and predicate. |
| core/images/converter/erofs/seekable.go | Adds conversion path to generate seekable EROFS blobs (+ annotations). |
| core/images/converter/erofs/erofs.go | Adds raw EROFS conversion helper used by seekable conversion. |
| cmd/ctr/commands/images/convert.go | Adds ctr images convert flags for raw/seekable EROFS and dm-verity options. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i := range manifest.Layers { | ||
| if len(manifest.Layers) == i+1 { | ||
| subprefix = childprefix + p.format.LastDrop | ||
| subchild = childprefix + p.format.Spacer | ||
| } | ||
| fmt.Fprintf(p.w, "%s%s @%s (%d bytes)\n", subprefix, manifest.Layers[i].MediaType, manifest.Layers[i].Digest, manifest.Layers[i].Size) | ||
|
|
||
| if err := p.showContent(ctx, store, manifest.Layers[i], subchild); err != nil { | ||
| return err | ||
| } |
| } | ||
| labelsMap[labels.LabelUncompressed] = digester.Digest().String() | ||
|
|
||
| if err := contentWriter.Commit(ctx, 0, "", content.WithLabels(labelsMap)); err != nil && !errdefs.IsAlreadyExists(err) { |
| // Read entire payload (chunk table header + entries) to verify the digest | ||
| payloadBytes := make([]byte, skippablePayloadSize) | ||
| if _, err := blob.ReadAt(payloadBytes, chunkTableOffset+zstdSkippableHeaderSizeBytes); err != nil { | ||
| return nil, fmt.Errorf("failed to read chunk table payload: %w", err) | ||
| } |
| func readChunkEntries(r io.Reader, entryCount int, hashSize uint32) ([]ChunkTableEntry, error) { | ||
| entries := make([]ChunkTableEntry, 0, entryCount) | ||
| var entryBuf [8]byte | ||
|
|
||
| for i := range entryCount { | ||
| // Read the entry | ||
| if _, err := io.ReadFull(r, entryBuf[:]); err != nil { | ||
| return nil, fmt.Errorf("failed to read chunk entry %d block offset: %w", i, err) | ||
| } | ||
| blockOffsetU64 := binary.LittleEndian.Uint64(entryBuf[:]) | ||
| if blockOffsetU64 > math.MaxInt64 { | ||
| return nil, fmt.Errorf("chunk entry %d block offset %d overflows int64: %w", i, blockOffsetU64, errdefs.ErrOutOfRange) | ||
| } | ||
| blockOffset := int64(blockOffsetU64) | ||
|
|
||
| // read the checksum, if present & validate | ||
| var checksum []byte | ||
| if hashSize != 0 { | ||
| checksum = make([]byte, hashSize) | ||
| if _, err := io.ReadFull(r, checksum); err != nil { | ||
| return nil, fmt.Errorf("failed to read chunk entry %d checksum: %w", i, err) | ||
| } | ||
| } | ||
| entries = append(entries, ChunkTableEntry{ | ||
| BlockOffset: blockOffset, | ||
| Checksum: checksum, | ||
| }) | ||
| } |
| enc, err := zstd.NewWriter(nil, zstd.WithEncoderConcurrency(1)) | ||
| if err != nil { | ||
| return 0, nil, fmt.Errorf("failed to create zstd encoder: %w", err) | ||
| } |
| os.WriteFile(layerBlob, []byte("fake"), 0644) | ||
|
|
||
| original := &DmverityMetadata{ | ||
| RootHash: "abc123def456789012345678901234567890123456789012345678901234", | ||
| HashOffset: 8192, | ||
| } | ||
| err := WriteMetadata(layerBlob, original) |
| p.printChunkEntry(tbl, i, chunkTableOffset, hashName, prefix) | ||
| } | ||
| } else { | ||
| for i := range maxChunkPreview { |
|
@anniecherk @dmcgowan I think we are too rush to dump |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Implements:
application/vnd.erofs.layer.v1+zstdctr images convertto convert layers to this formatDoes NOT implement:
Currently rebased on top of #12772 and #12555.
Seeing it in action:
Pull the alpine image:
Convert to the seekable-erofs format w/ dm-verity support:
Check the annotations on the layer:
Run the image: