-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erofs snapshotter and differ #10705
Erofs snapshotter and differ #10705
Conversation
7b9dbc7
to
ccc50bf
Compare
f57841a
to
e9eed08
Compare
e9eed08
to
879e392
Compare
626a4ca
to
ec82d81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests will be needed.
plugins/diff/erofs/differ.go
Outdated
return "", fmt.Errorf("invalid filesystem type for erofs differ: %w", errdefs.ErrNotImplemented) | ||
} | ||
// If the layer is not prepared by the EROFS snapshotter, fall back to the next differ | ||
if _, err := os.Stat(filepath.Join(layer, ".erofslayer")); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks difficult to maintain... I think applier shouldn't be aware of internal directory structure of a snapshotter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think applier shouldn't be aware of internal directory structure of a snapshotter.
For generic differs I think, yes. But this differ is only useful together with erofs snapshotter.
All cases are as below:
erofs snapshot layer + erofs differ : convert OCI media types (or erofs native blobs later)
erofs snapshot layer + other differs (walking for example) : just unpack to fs
and behaves as overlay;
other snapshot layer (e.g. overlay or remote snapshotter) + erofs differ: move to the next differ;
other snapshot layer + other differ : don't care.
This looks difficult to maintain...
Why? Basically it needs a way to identify a layer prepared by erofs snapshotter to ensure the safety.
Do you have better ideas for this requirement?..
I've checked the previous windows differs, but they seems they just use a special mount structure. But I think this method is more clean compared to that since the mount structure can be mounted directly (so that erofs snapshotter can be used with walking differ) rather than a difficult-to-identify mount structure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
This looks difficult to maintain...
Why? Basically it needs a way to identify a layer prepared by erofs snapshotter to ensure the safety. Do you have better ideas for this requirement?.. I've checked the previous windows differs, but they seems they just use a special mount structure. But I think this method is more clean compared to that since the mount structure can be mounted directly (so that erofs snapshotter can be used with walking differ) rather than a difficult-to-identify mount structure...
Derek has an ongoing mount manager which could help us to clean up this with new mount[] structure too, I think after this work lands, EROFS snapshotter could have a more clean way then.
But I guess they could be done in parallel? I do hope if you could agree..
I will try to add tests to run erofs snapshotter later. |
03fc459
to
0c2285b
Compare
Hi @AkihiroSuda, sorry for my Go skills again, I've revised a new version and thanks for your suggestion! |
8cf5146
to
1dae4f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi @AkihiroSuda, thanks for pinging. I could summerize the benefits of
I do hope the kernel features above could be fully leveraged in a dedicated way for container ecosystem, that is why I worked on this. I do hope it can be upstreamed into containerd since it has been pending for almost two years since it was proposed. |
Sorry for late reply. It's on my list. Will take a look in this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see the first version simplified a bit. If we have the merging implementation working, we should stick with that for the first version and add the overlayed erofs layers later on. We could also avoid mixing erofs layers and non-erofs layers by converting on Commit.
Let me know if that really won't work well or the merging is not the preferred solution. We still have time to make improvements and add features to this before the next release.
|
||
_, imagePull = info.Labels[targetSnapshotLabel] | ||
if !imagePull && s.mergeFsMeta { | ||
s.GenerateFsMeta(ctx, snap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do this on commit? Create snapshot should be very fast. Also this could be done outside the transaction then renamed inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Derek, thanks for the suggestion.
The merged fsmeta generation is triggered by the non-unpacking snapshot preparation ("containerd.io/snapshot.ref").
Actually we don't want to generate merged fsmeta on every commited layer (otherwise it will impact performance, I don't want to generate fsmeta all the time), but only for all the parent layers if some layer is not using for unpacking. And fsmeta is kept into the runable layer, not every commited layers.
}}, nil | ||
} | ||
|
||
// generate a metadata-only EROFS fsmeta.erofs if the parent layers are all EROFS blobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing some parent layers to not be EROFS blobs makes this snapshotter more complicated.
} | ||
|
||
func (s *snapshotter) Commit(ctx context.Context, name, key string, opts ...snapshots.Opt) error { | ||
return s.ms.WithTransaction(ctx, true, func(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before we enter this transaction is the best time to generate the fsmeta blob or convert the flat directory to erofs (to avoid having a mix of erofs and non-erofs snapshots).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've explained why generating the fsmeta blob here is ineffiencient above.
I think it's reasonable to convert the flat directory to erofs
, let me try this.
ovlOptions []string | ||
|
||
// If fsmerge feature is enabled or not | ||
mergeFsmeta bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the performance like for this? In the initial version I think it would be best to have this as the default and possibly not even include the loopback version yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mainly used to avoid too many erofs layered mounts and used for secure container virtio-blk passthrough. In other words, it's mainly be used as an advanced feature for now.
If fsmeta merging generation is stayed into Prepare(), it will almost the same performance number against mount one-by-one;
If fsmeta merging generation is moved into Commit(), it will be performed bad against mount one-by-one.
I could reconfirm the exact performance number, but if you think fsmerge feature is still somewhat unclean, I could just drop this feature for now and only leave a very basic version without fsmerge feature.
Yes, it's quite a clever idea to convert non-erofs layer to erofs layer on Commit (although non-erofs layer conversion will make erofs snapshotter together with the walking differ performs longer than the overlayfs snapshotter, but I could see some benefit is that mixing overlay dir is bad for VM passthrough use cases...)
Although "merging is not the preferred solution" but it needs to be generated for Active Snapshot for running containers later, rather than snapshots for unpacking.
Yeah, agreed, thanks! |
The EROFS differ only applies to EROFS layers which are marked by a special file `.erofslayer` generated by the EROFS snapshotter. Why it's needed? Since we'd like to parse []mount.Mount directly without actual mounting and convert OCI layers into EROFS blobs, `.erofslayer` gives a hint that the active snapshotter supports the output blob generated by the EROFS differ. I'd suggest it could be read together with the next commit. Signed-off-by: cardy.tang <[email protected]> Signed-off-by: Gao Xiang <[email protected]>
1dae4f0
to
66f5a49
Compare
It allows us to mount each EROFS blob layer (generated by the EROFS differ) independently, or use the "unpacked" fs/ directories (if some other differ is used.) Currently, it's somewhat like the overlay snapshotter, but I tend to separate the new EROFS logic into a self-contained component, rather than keeping it tangled in the very beginning. Existing users who use the overlay snapshotter won't be impacted at all but they have a chance to use this new snapshotter to leverage the EROFS filesystem. Signed-off-by: cardy.tang <[email protected]> Signed-off-by: Gao Xiang <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
Some basic tests for now. Signed-off-by: Gao Xiang <[email protected]>
66f5a49
to
2f15d65
Compare
Just a general question, is this snapshotter planned to be in-tree? |
Yes. At least that is the plan on my side, which will provide EROFS kernel features to users much like other in-tree snapshotters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested in this from Kata/CoCo perspective. I did an initial check of the PR and test runs with runc.
|
||
``` | ||
[plugins."io.containerd.service.v1.diff-service"] | ||
default = ["erofs","walking"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing this warning in the logs:
WARN[2025-01-31T14:27:36.659539529+02:00] multiple differs match for platform, set `differ` option to choose, skipping "walking"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems it relates to transfer service, but I don't really know the details too.
EROFS snapshotter can only work effectively with the related differ, so just
leaving "default = ["erofs"]" is fine in practice.
in the very beginning, it'd be better to be left as an independent snapshotter | ||
so that existing overlayfs users won't be impacted by the new behaviors and | ||
users could have a chance to try and develop the related ecosystems (such as | ||
ComposeFS, confidential containers, gVisor, Kata, gVisor, and more) together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested in the "confidential containers" case. How the guest would know the snapshotter prepared everything correctly (and the image integrity is preserved)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some changes should be made in the kata-shim, currently I don't have enough time to prepare a optimized workable implementation for this, but after this lands, I will try to sort out these.
|
||
- Native EROFS layers can be pulled from registries without conversion. | ||
|
||
For VM containers, the EROFS snapshotter can efficiently pass through and share |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the VM based use-case, it feels to me this has some overlaps with the blockfile
snapshotter too.
Is there currently a way to get this tried out with Kata based setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in principle, the blockfile approach is ineffective if underlay fs doesn't support reflink.
Also I'd like to use vhost-user protocol (+ a merged flatten device) to pass through the container image for VM-based containers, rather than generating some loopback device on the host, which is ineffective too.
Is there currently a way to get this tried out with Kata based setup?
No, as I said above. I will not work out any further since I don't have enough time (and it's also useless) to complete all the use cases before upstreaming.
Personally I think it's a very effective approach for VM-based image pass through. but currently I don't really have a promising reason to persuade our development dept. to put more development resource on it, esp. before upstreaming.
In order to leverage EROFS-formatted blobs, the EROFS differ is needed to be | ||
used together to apply image layers. Otherwise, the EROFS snapshotter will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAUI, this handles the whiteouts also. perhaps add a note about it (and how it's done (in mkfs.erofs
?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAUI, this handles the whiteouts also. perhaps add a note about it (and how it's done (in
mkfs.erofs
?).
I could try to mention it a bit, but that is an implementation detail, since it's a real necessary part to support OCI images. --aufs
option in mkfs.erofs
will handle AUFS-like whiteouts and opaques properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in, there is time to test and iterate on it before the 2.1 release
Implements #9361
Sorry about my limited Go skill, this version refines the previous @cardyok version (I revised it in my spare time),
I think it's clean for upstreaming.
In order to try this:
modprobe erofs
with Linux 5.16 or higher (for now);nerdctl --snapshotter=erofs run -it wordpress:latest
Some preliminary performance numbers: #10705 (comment)
TODOs: