Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erofs snapshotter and differ #10705

Merged
merged 4 commits into from
Feb 5, 2025
Merged

Conversation

hsiangkao
Copy link
Contributor

@hsiangkao hsiangkao commented Sep 18, 2024

Implements #9361

Sorry about my limited Go skill, this version refines the previous @cardyok version (I revised it in my spare time),
I think it's clean for upstreaming.

  1. Add an initial support for EROFS snapshotter and EROFS differ;
  2. Support file-backed mounts which is released in Linux 6.12 https://www.phoronix.com/news/EROFS-Linux-6.12;

In order to try this:

  1. Install erofs-utils v1.7.x or v1.8.2+ (e.g. erofs-utils 1.8 and 1.8.1 has a regression.);
  2. modprobe erofs with Linux 5.16 or higher (for now);
  3. configure the proper differ:
[plugins]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["erofs","walking"]
  1. nerdctl --snapshotter=erofs run -it wordpress:latest

Some preliminary performance numbers: #10705 (comment)

TODOs:

  1. code improvements and cleanups;
  2. support merged fsmeta layer to avoid too many unnecessary EROFS mounts;
    image
  3. chattr +i and fsverity support to fully protect layer blobs from data corruption;
  4. parallel unpacking after a proper way is found(also see: Parallel Container Layer Unpacking #8881): In principle, erofs can support parallel unpacking in a more reliable way since each container layer each erofs blob instead of each container layer each overlay dir;
  5. other alternative form such as composefs, etc.;
  6. some further lazy pulling enhancements (needs more extra work) discussed in Container Plumbing Days OCI meeting in April.

@hsiangkao hsiangkao force-pushed the erofs-snapshotter branch 10 times, most recently from 7b9dbc7 to ccc50bf Compare September 19, 2024 04:41
@hsiangkao hsiangkao force-pushed the erofs-snapshotter branch 2 times, most recently from f57841a to e9eed08 Compare September 19, 2024 04:47
@hsiangkao
Copy link
Contributor Author

hsiangkao commented Sep 23, 2024

Some preliminary performance numbers comparing erofs and overlayfs snapshotter with the following config, both are without parallel unpacking:

version = 2

[plugins]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["erofs","walking"]

  [plugins."io.containerd.differ.v1.erofs"]
    mkfs_options = ["--sort=none"]

kernel version: Linux iZt4n84m6mwq9w1gcpsbfvZ 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux

# nerdctl image pull --snapshotter=overlayfs --unpack="false" docker.io/library/wordpress:latest
# time nerdctl container --snapshotter=overlayfs run -d docker.io/library/wordpress:latest

real 0m5.654s
user 0m0.090s
sys 0m0.013s

# nerdctl image pull --snapshotter=erofs --unpack="false" docker.io/library/wordpress:latest
# time nerdctl container --snapshotter=erofs run -d docker.io/library/wordpress:latest

real 0m4.838s
user 0m0.083s
sys 0m0.023s

and ctr image pull results

# ctr image pull --snapshotter=erofs --local docker.io/library/wordpress:latest

...
unpacking linux/amd64 sha256:a46a4a00e427168b92df0c792d0ab766bd1d4f111aaaae45fd0173ece590b6f3...
done: 4.604676715s

# ctr image pull --snapshotter=overlayfs --local docker.io/library/wordpress:latest

...
unpacking linux/amd64 sha256:a46a4a00e427168b92df0c792d0ab766bd1d4f111aaaae45fd0173ece590b6f3...
done: 5.443237423s

In this way, fsmeta merge feature is disabled as below and loop devices are still used (not the latest kernel):
image

@hsiangkao hsiangkao force-pushed the erofs-snapshotter branch 2 times, most recently from 626a4ca to ec82d81 Compare September 23, 2024 12:14
Copy link
Member

@ktock ktock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests will be needed.

return "", fmt.Errorf("invalid filesystem type for erofs differ: %w", errdefs.ErrNotImplemented)
}
// If the layer is not prepared by the EROFS snapshotter, fall back to the next differ
if _, err := os.Stat(filepath.Join(layer, ".erofslayer")); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks difficult to maintain... I think applier shouldn't be aware of internal directory structure of a snapshotter.

Copy link
Contributor Author

@hsiangkao hsiangkao Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think applier shouldn't be aware of internal directory structure of a snapshotter.

For generic differs I think, yes. But this differ is only useful together with erofs snapshotter.
All cases are as below:
erofs snapshot layer + erofs differ : convert OCI media types (or erofs native blobs later)
erofs snapshot layer + other differs (walking for example) : just unpack to fs and behaves as overlay;
other snapshot layer (e.g. overlay or remote snapshotter) + erofs differ: move to the next differ;
other snapshot layer + other differ : don't care.

This looks difficult to maintain...

Why? Basically it needs a way to identify a layer prepared by erofs snapshotter to ensure the safety.
Do you have better ideas for this requirement?..
I've checked the previous windows differs, but they seems they just use a special mount structure. But I think this method is more clean compared to that since the mount structure can be mounted directly (so that erofs snapshotter can be used with walking differ) rather than a difficult-to-identify mount structure...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

This looks difficult to maintain...

Why? Basically it needs a way to identify a layer prepared by erofs snapshotter to ensure the safety. Do you have better ideas for this requirement?.. I've checked the previous windows differs, but they seems they just use a special mount structure. But I think this method is more clean compared to that since the mount structure can be mounted directly (so that erofs snapshotter can be used with walking differ) rather than a difficult-to-identify mount structure...

Derek has an ongoing mount manager which could help us to clean up this with new mount[] structure too, I think after this work lands, EROFS snapshotter could have a more clean way then.
But I guess they could be done in parallel? I do hope if you could agree..

@hsiangkao
Copy link
Contributor Author

Tests will be needed.

I will try to add tests to run erofs snapshotter later.

@hsiangkao hsiangkao force-pushed the erofs-snapshotter branch 3 times, most recently from 03fc459 to 0c2285b Compare September 26, 2024 07:47
@hsiangkao
Copy link
Contributor Author

Looks good, but //go:build linux line should be dropped and _linux.go suffix should be used instead

Hi @AkihiroSuda, sorry for my Go skills again, I've revised a new version and thanks for your suggestion!

Copy link
Member

@djdongjin djdongjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fuweid fuweid self-requested a review December 18, 2024 18:00
@AkihiroSuda
Copy link
Member

@fuweid @dmcgowan LGTY?

@hsiangkao
Copy link
Contributor Author

hsiangkao commented Jan 9, 2025

@fuweid @dmcgowan LGTY?

Hi @AkihiroSuda, thanks for pinging. I could summerize the benefits of erofs snapshotter here again (maybe update & refine):

  1. It doesn't need to mount each layer individually compared to SquashFS, EXT4, etc. In other words, only one mount is required for each container image instead of one mount per layer (otherwise, e.g., Wordpress could have 20+ layers → 20+ mounts). For secure containers, it's also beneficial to pass container images with a single virtio-blk to VM rather than 20+ virtio-blks.
  2. It doesn't need loop devices to mount compared with other kernel filesystems since Linux 6.12 LTS.
  3. Compared to EXT4, XFS and BTRFS, etc., EROFS doesn't need to estimate the total space and reserve in advance to keep container image data.
  4. Compared to SquashFS, EROFS natively supports AUFS-style whiteout/opaque, which is useful for OCI layered images.
  5. Compared to SquashFS, EROFS supports many advanced features like FSDAX to eliminate guest page cache for secure containers.

I do hope the kernel features above could be fully leveraged in a dedicated way for container ecosystem, that is why I worked on this. I do hope it can be upstreamed into containerd since it has been pending for almost two years since it was proposed.

@fuweid
Copy link
Member

fuweid commented Jan 9, 2025

Sorry for late reply. It's on my list. Will take a look in this week.

Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see the first version simplified a bit. If we have the merging implementation working, we should stick with that for the first version and add the overlayed erofs layers later on. We could also avoid mixing erofs layers and non-erofs layers by converting on Commit.

Let me know if that really won't work well or the merging is not the preferred solution. We still have time to make improvements and add features to this before the next release.


_, imagePull = info.Labels[targetSnapshotLabel]
if !imagePull && s.mergeFsMeta {
s.GenerateFsMeta(ctx, snap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do this on commit? Create snapshot should be very fast. Also this could be done outside the transaction then renamed inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Derek, thanks for the suggestion.
The merged fsmeta generation is triggered by the non-unpacking snapshot preparation ("containerd.io/snapshot.ref").
Actually we don't want to generate merged fsmeta on every commited layer (otherwise it will impact performance, I don't want to generate fsmeta all the time), but only for all the parent layers if some layer is not using for unpacking. And fsmeta is kept into the runable layer, not every commited layers.

}}, nil
}

// generate a metadata-only EROFS fsmeta.erofs if the parent layers are all EROFS blobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing some parent layers to not be EROFS blobs makes this snapshotter more complicated.

}

func (s *snapshotter) Commit(ctx context.Context, name, key string, opts ...snapshots.Opt) error {
return s.ms.WithTransaction(ctx, true, func(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before we enter this transaction is the best time to generate the fsmeta blob or convert the flat directory to erofs (to avoid having a mix of erofs and non-erofs snapshots).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've explained why generating the fsmeta blob here is ineffiencient above.
I think it's reasonable to convert the flat directory to erofs, let me try this.

ovlOptions []string

// If fsmerge feature is enabled or not
mergeFsmeta bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance like for this? In the initial version I think it would be best to have this as the default and possibly not even include the loopback version yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mainly used to avoid too many erofs layered mounts and used for secure container virtio-blk passthrough. In other words, it's mainly be used as an advanced feature for now.
If fsmeta merging generation is stayed into Prepare(), it will almost the same performance number against mount one-by-one;
If fsmeta merging generation is moved into Commit(), it will be performed bad against mount one-by-one.

I could reconfirm the exact performance number, but if you think fsmerge feature is still somewhat unclean, I could just drop this feature for now and only leave a very basic version without fsmerge feature.

@hsiangkao
Copy link
Contributor Author

hsiangkao commented Jan 10, 2025

I would like to see the first version simplified a bit. If we have the merging implementation working, we should stick with that for the first version and add the overlayed erofs layers later on. We could also avoid mixing erofs layers and non-erofs layers by converting on Commit.

Yes, it's quite a clever idea to convert non-erofs layer to erofs layer on Commit (although non-erofs layer conversion will make erofs snapshotter together with the walking differ performs longer than the overlayfs snapshotter, but I could see some benefit is that mixing overlay dir is bad for VM passthrough use cases...)
But generating fsmerge on Commit as well is unacceptable for users, if you think generating fsmeta on Prepare() is unclean, I tend to drop this "fsmerge" feature for now (thus loopback device handling in the current erofs snapshotter codebase is no longer needed either.) I'm pretty fine with that.

Let me know if that really won't work well or the merging is not the preferred solution.

Although "merging is not the preferred solution" but it needs to be generated for Active Snapshot for running containers later, rather than snapshots for unpacking.

We still have time to make improvements and add features to this before the next release.

Yeah, agreed, thanks!

The EROFS differ only applies to EROFS layers which are marked by
a special file `.erofslayer` generated by the EROFS snapshotter.

Why it's needed?  Since we'd like to parse []mount.Mount directly
without actual mounting and convert OCI layers into EROFS blobs,
`.erofslayer` gives a hint that the active snapshotter supports
the output blob generated by the EROFS differ.

I'd suggest it could be read together with the next commit.

Signed-off-by: cardy.tang <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
It allows us to mount each EROFS blob layer (generated by the EROFS
differ) independently, or use the "unpacked" fs/ directories (if
some other differ is used.)

Currently, it's somewhat like the overlay snapshotter, but I tend
to separate the new EROFS logic into a self-contained component,
rather than keeping it tangled in the very beginning.

Existing users who use the overlay snapshotter won't be impacted
at all but they have a chance to use this new snapshotter to
leverage the EROFS filesystem.

Signed-off-by: cardy.tang <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
Some basic tests for now.

Signed-off-by: Gao Xiang <[email protected]>
@rchincha
Copy link

Just a general question, is this snapshotter planned to be in-tree?

@hsiangkao
Copy link
Contributor Author

hsiangkao commented Jan 14, 2025

Just a general question, is this snapshotter planned to be in-tree?

Yes. At least that is the plan on my side, which will provide EROFS kernel features to users much like other in-tree snapshotters.

@dmcgowan dmcgowan added this to the 2.1 milestone Jan 22, 2025
Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in this from Kata/CoCo perspective. I did an initial check of the PR and test runs with runc.


```
[plugins."io.containerd.service.v1.diff-service"]
default = ["erofs","walking"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing this warning in the logs:

WARN[2025-01-31T14:27:36.659539529+02:00] multiple differs match for platform, set `differ` option to choose, skipping "walking" 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems it relates to transfer service, but I don't really know the details too.
EROFS snapshotter can only work effectively with the related differ, so just
leaving "default = ["erofs"]" is fine in practice.

in the very beginning, it'd be better to be left as an independent snapshotter
so that existing overlayfs users won't be impacted by the new behaviors and
users could have a chance to try and develop the related ecosystems (such as
ComposeFS, confidential containers, gVisor, Kata, gVisor, and more) together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in the "confidential containers" case. How the guest would know the snapshotter prepared everything correctly (and the image integrity is preserved)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some changes should be made in the kata-shim, currently I don't have enough time to prepare a optimized workable implementation for this, but after this lands, I will try to sort out these.


- Native EROFS layers can be pulled from registries without conversion.

For VM containers, the EROFS snapshotter can efficiently pass through and share
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the VM based use-case, it feels to me this has some overlaps with the blockfile snapshotter too.

Is there currently a way to get this tried out with Kata based setup?

Copy link
Contributor Author

@hsiangkao hsiangkao Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in principle, the blockfile approach is ineffective if underlay fs doesn't support reflink.
Also I'd like to use vhost-user protocol (+ a merged flatten device) to pass through the container image for VM-based containers, rather than generating some loopback device on the host, which is ineffective too.

Is there currently a way to get this tried out with Kata based setup?

No, as I said above. I will not work out any further since I don't have enough time (and it's also useless) to complete all the use cases before upstreaming.
Personally I think it's a very effective approach for VM-based image pass through. but currently I don't really have a promising reason to persuade our development dept. to put more development resource on it, esp. before upstreaming.

Comment on lines +7 to +8
In order to leverage EROFS-formatted blobs, the EROFS differ is needed to be
used together to apply image layers. Otherwise, the EROFS snapshotter will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAUI, this handles the whiteouts also. perhaps add a note about it (and how it's done (in mkfs.erofs?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAUI, this handles the whiteouts also. perhaps add a note about it (and how it's done (in mkfs.erofs?).

I could try to mention it a bit, but that is an implementation detail, since it's a real necessary part to support OCI images. --aufs option in mkfs.erofs will handle AUFS-like whiteouts and opaques properly.

Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in, there is time to test and iterate on it before the 2.1 release

@dmcgowan dmcgowan added this pull request to the merge queue Feb 5, 2025
Merged via the queue into containerd:main with commit 59c8cf6 Feb 5, 2025
58 checks passed
@dmcgowan dmcgowan mentioned this pull request Feb 6, 2025
7 tasks
@dmcgowan dmcgowan changed the title [Feat] erofs snapshotter and differ Erofs snapshotter and differ Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants