Skip to content

WIP RFC: erofs-snapshotter: support .erofs+{zstd|gzip} images#12506

Closed
anniecherk wants to merge 1 commit intocontainerd:mainfrom
anniecherk:ac/erofs-plus-zstd
Closed

WIP RFC: erofs-snapshotter: support .erofs+{zstd|gzip} images#12506
anniecherk wants to merge 1 commit intocontainerd:mainfrom
anniecherk:ac/erofs-plus-zstd

Conversation

@anniecherk
Copy link
Copy Markdown

Goal:

The current implementation does not support erofs images that are compressed into .erofs+{zstd|gzip} images. Erofs itself supports compression natively, and so the current implementation has three options:

  • use erofs images w/out native compression (uncompressed during image pulls, uncompressed on disk), or
  • use erfos images w native compression (compressed during image pulls, compressed on disk), or
  • use overlayfs+zstd images and convert to erofs images on apply (compressed during image pulls, converted at unpack time, uncompressed on disk)

The option that would be give us the fastest startup + best runtime performance is:

  • using erofs images (so that we don't wait to convert from overlayfs to erofs on apply)
  • that are compressed when we're pulling (for fast pulls), and
  • decompressed on disk (for fast reads at runtime)

That corresponds to .erofs layers that are natively uncompressed, but compressed / decompressed by the diff processor.


Initial experimentation:

I created a small cli tool for converting existing overlayfs images to .erofs+zstd images. I converted some images to .erofs+zstd, then uploaded them to a registry, and then timed the pull + unpack. I saw 1.5-2x faster pulls over the "overlayfs + convert to erofs at unpack" path.


Looking for feedback:

I'm looking for some quick feedback-- does this approach make sense?

In particular, I'm curious for more context on this comment:

// Since `images.DiffCompression` doesn't support arbitrary media types,
// disallow non-empty suffixes for now.

My code changes images.DiffCompression to support this-- does that change makes sense, or is there a reason why it shouldn't support arbitrary media types?

The code here is a draft; if this approach is sound I can polish the code, run some more principled experiments & add a test.

Goal:

The current implementation does not support erofs images that are compressed into .erofs+{zstd|gzip} images. Erofs itself supports compression natively, and so the current implementation has three options:

- use erofs images w/out native compression (uncompressed during image pulls, uncompressed on disk), or
- use erfos images w native compression (compressed during image pulls, compressed on disk), or
- use overlayfs+zstd images and convert to erofs images on apply (compressed during image pulls, converted at unpack time, uncompressed on disk)

The option that would be give us the fastest startup + best runtime performance is:
- using erofs images (so that we don't wait to convert from overlayfs to erofs on apply)
- that are compressed when we're pulling (for fast pulls), and
- decompressed on disk (for fast reads at runtime)

That corresponds to .erofs layers that are natively uncompressed, but compressed / decompressed by the diff processor.

-----
Initial experimentation:

I created a small cli tool for converting existing overlayfs images to .erofs+zstd
images. I converted some images to .erofs+zstd, then uploaded them to a registry, and then timed the pull + unpack. I saw 1.5-2x faster pulls over the "overlayfs + convert to erofs at unpack" path.

-----
Looking for feedback:

I'm looking for some quick feedback-- does this approach make sense?

In particular, I'm curious for more context on this comment:
```
// Since `images.DiffCompression` doesn't support arbitrary media types,
// disallow non-empty suffixes for now.
```
My code changes images.DiffCompression to support this-- does that change makes sense, or is there a reason why it shouldn't support arbitrary media types?

The code here is a draft; if this approach is sound I can polish the code, run some more principled experiments & add a test.
@hsiangkao
Copy link
Copy Markdown
Member

hsiangkao commented Nov 11, 2025

Sorry, please ignore my previous comment.

In principle, this can be supported. However, compared to native EROFS compressed images, such a way may lack erofs native random filesystem access capability (especially since it’s already a converted, non-OCI format).
Instead, if we introduce a chunked zstd stream (so that zstd wrapper is just applied on the wire), we could potentially enable random access for .erofs+zstd as well.

The option that would be give us the fastest startup + best runtime performance is:

using erofs images (so that we don't wait to convert from overlayfs to erofs on apply)
that are compressed when we're pulling (for fast pulls), and
decompressed on disk (for fast reads at runtime)

Yes, if we consider the best runtime performance, uncompressed erofs images are preferred since there is no decompression runtime overhead and the network transformation can be optimized by zstd or other wrapper formats.

Anyway, I think there are a bunch of available options and this option can be supported since at least it can be parsed seemlessly without noticable additional code complexity I think. Also try to cc @dmcgowan

@anniecherk
Copy link
Copy Markdown
Author

Hi @hsiangkao @dmcgowan

I ran a small experiment timing the pull & unpack of the latest pytorch image with this code, and wrote about the setup + results here. Quick summary is I'm seeing half the pull+unpack time relative to the overlayfs snapshotter & the erofs snapshotter doing the conversion at unpack time. In that writeup I describe the small CLI tool that I built to produce the erofs+zstd image, and it's on my todo list to clean that up and open that as a separate PR to complement this one.

Would y'all be willing to review this code & let me know (1) whether we're aligned on supporting / allowing erofs+zstd, and if so, (2) thoughts on the current implementation?

I had gotten pulled away after opening this PR about a month ago but now have lots of bandwidth for the next few weeks. I'd be excited to have this functionality in containerd & am more than happy to iterate on any feedback y'all have.

@anniecherk
Copy link
Copy Markdown
Author

@hsiangkao re: chunked zstd stream, that sound like a great idea to me. Does it sound good to put in a basic implementation without chunking first and then iterate on that as a later pass? I'd be interested to work on that but would ideally like to decouple from this PR to make incremental progress.

@hsiangkao
Copy link
Copy Markdown
Member

Hi @hsiangkao @dmcgowan

I ran a small experiment timing the pull & unpack of the latest pytorch image with this code, and wrote about the setup + results here. Quick summary is I'm seeing half the pull+unpack time relative to the overlayfs snapshotter & the erofs snapshotter doing the conversion at unpack time. In that writeup I describe the small CLI tool that I built to produce the erofs+zstd image, and it's on my todo list to clean that up and open that as a separate PR to complement this one.

Would y'all be willing to review this code & let me know (1) whether we're aligned on supporting / allowing erofs+zstd, and if so, (2) thoughts on the current implementation?

I had gotten pulled away after opening this PR about a month ago but now have lots of bandwidth for the next few weeks. I'd be excited to have this functionality in containerd & am more than happy to iterate on any feedback y'all have.

Personally I'm totally fine to support this feature since it doesn't introduce extra logic and benefit to AI use cases.

re: chunked zstd stream, that sound like a great idea to me. Does it sound good to put in a basic implementation without chunking first and then iterate on that as a later pass? I'd be interested to work on that but would ideally like to decouple from this PR to make incremental progress.

Fine with me.

@dmcgowan
Copy link
Copy Markdown
Member

@hsiangkao re: chunked zstd stream, that sound like a great idea to me. Does it sound good to put in a basic implementation without chunking first and then iterate on that as a later pass? I'd be interested to work on that but would ideally like to decouple from this PR to make incremental progress.

I think we can be way more restrictive here and only support compressed blobs in a way we know we can handle it efficiently and with random access. If the goal is to just support transport compression, that should be done at the transport layer. In hindsight, referencing compressed tars was a mistake and we should be careful not to just copy that for consistency.

If we need compression at rest and native compression is not suitable, I would be +1 for only supporting zstd chunked. We can always add more compression support later if someone comes up with a compelling case for it, but removing support for those compressions will be difficult and it may complicate our ability to support lazy pulling.

@hsiangkao
Copy link
Copy Markdown
Member

hsiangkao commented Dec 17, 2025

@hsiangkao re: chunked zstd stream, that sound like a great idea to me. Does it sound good to put in a basic implementation without chunking first and then iterate on that as a later pass? I'd be interested to work on that but would ideally like to decouple from this PR to make incremental progress.

I think we can be way more restrictive here and only support compressed blobs in a way we know we can handle it efficiently and with random access. If the goal is to just support transport compression, that should be done at the transport layer. In hindsight, referencing compressed tars was a mistake and we should be careful not to just copy that for consistency.

If we need compression at rest and native compression is not suitable, I would be +1 for only supporting zstd chunked. We can always add more compression support later if someone comes up with a compelling case for it, but removing support for those compressions will be difficult and it may complicate our ability to support lazy pulling.

Hi derek @dmcgowan, I think for people who care more about runtime performance might be concerned about native erofs compression (although for example, lz4 can outperform the uncompressed images in many setups, but it causes larger images on the wire since lz4 compresses less; but for zstd, lzma EROFS native compression, it has noticable runtime overhead anyway so that they might not be useful for high-performance cloud environments for example.)

I wonder if it's possible to support +zstd for now to achieve transport compression only, and if people would like to lazy pulling this, just as we expect, use zstd chunked to split the uncompressed images into 2MiB chunks for example, and dmverity can still apply to the original (uncompressed) image.

For native erofs images, it's just application/vnd.erofs.layer.overlayfs.v1.erofs for example, and containerd don't need to know any internal implementation since it justs use erofs raw blobs. -- containerd doesn't need to know anything for this setup, we could wrap them up in go-erofs for full go support.

For zstd wrappers, it seems it should be application/vnd.erofs.layer.overlayfs.v1.erofs+zstd for containerd to decompress first, like the current zstd stream processor -- as the first step, the chunked detailed format seems unnecessary to be discussed here.

@hsiangkao
Copy link
Copy Markdown
Member

hsiangkao commented Dec 17, 2025

@anniecherk, derek @dmcgowan just mentioned another possibility: Is it possible to just enable http zstd compression for EROFS-formatted blobs in the container registry?
That way, the blob digest would still be the original erofs sha256 rather than a randomly wrapped one, zstd-compressed blobs could also be cached on the registry side, and it would also save transport bandwidth.

Does that sound like a better alternative?

@anniecherk
Copy link
Copy Markdown
Author

Is it possible to just enable http zstd compression for EROFS-formatted blobs in the container registry

That's a clean solution, but unfortunately it doesn't work with our setup. Our registry redirects to an object store to serve the actual blob bytes, and the backing object store doesn't support the Accept-Encoding header or other dynamic compression requests.

Let me think through what an implementation supporting a chunked zstd stream would look like.

@k8s-ci-robot
Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@anniecherk
Copy link
Copy Markdown
Author

closing as this is now superseded by #12764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants