Skip to content

Conversation

@ambarve
Copy link
Contributor

@ambarve ambarve commented Jul 12, 2023

Details described in #8346

@k8s-ci-robot
Copy link

Hi @ambarve. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ambarve ambarve force-pushed the cimfs branch 4 times, most recently from 80b7a6e to 65dba00 Compare July 17, 2023 16:50
@ambarve ambarve marked this pull request as ready for review July 17, 2023 16:50
replace github.com/AdaLogics/go-fuzz-headers => github.com/AdamKorcz/go-fuzz-headers-1 v0.0.0-20230111232327-1f10f66a31bf
replace (
github.com/AdaLogics/go-fuzz-headers => github.com/AdamKorcz/go-fuzz-headers-1 v0.0.0-20230111232327-1f10f66a31bf
github.com/Microsoft/hcsshim => github.com/ambarve/hcsshim v0.9.1-0.20230715003907-4c7289998f8a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR still WIP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP is only for the vendoring commit. Once the hcsshim PRs are merged and we have a tag on hcsshim, I will update the go.mod with that tag and removed the WIP tag.

@slonopotamus
Copy link
Contributor

slonopotamus commented Jul 17, 2023

I see that mount/mount_windows.go is still WCIFS-only. Is it intentional? You can end up there by multiple codepaths, for example func (b *Bundle) Delete() in runtime/v2/bundle.go through cleanupAfterDeadShim. How this is supposed to work when CimFS snapshotter is used?

We just discussed the same topic in #8789: containerd has several mount/unmount codepaths that bypass the snapshotter and ended up with "containerd snapshotter API needs to be improved". But in that case it can be delayed (there's only one snapshotter anyway). But I don't see how current logic can work with WCIFS+CimFS.

@ambarve
Copy link
Contributor Author

ambarve commented Jul 18, 2023

@slonopotamus CimFs works evne without making any changes to the existing mount APIs because

  1. we currently don't support layer extraction / image building with CimFs. Containerd primarily mounts the snapshots for those operations and since CimFs doesn't support it, this isn't a problem. (But I agree that attempt to mount CimFs snapshots should throw a proper error)
  2. During container cleanup hcsshim ensures that all the mounted CimFs snapshots are properly cleaned.

After thinking about this more, I am going to update this PR to include the mount/unmount code for CimFs snapshots. That will simplify hcsshim cleanup logic too.

@slonopotamus
Copy link
Contributor

Just in case: I'm not insisting on any changes in current PR, I'm just trying to build a mental model about all this mount stuff in my head)

@ambarve
Copy link
Contributor Author

ambarve commented Jul 31, 2023

@AkihiroSuda & @slonopotamus, I have thought about the Mounts of CimFs based snapshots and I have decided to not include support for Mounting/Unmounting of CimFs based snapshots in this PR.
Currently, the windows snapshotter's Mount/Unmount functionality in containerd is only used during the image creation process i.e. those Un/mount functions aren't used at all during a container's startup and teardown process. Mounting/Unmounting of container image layers is completely handled by hcsshim as of now. Since, we currently don't support export of CimFs layers, building windows container images with CimFs layers isn't an option anyway and hcsshim handles the Mounting/Unmounting of CimFs layers for container startup/teardown. Adding the CimFs snapshot mount/unmount functionality in containerd won't provide any benefit right now.
Moreover, as mentioned earlier, Mounting/Unmounting of windows snapshots done by containerd & hcsshim are completely independent. i.e. if a layer is mounted by hcsshim, containerd has not idea about that mount and cannot clean it up in case the shim process crashes. This means we probably need a generic mount manager type of solution which will allow containerd to keep track of all the mounts (including the mounts made by the shim processes). However, design and implementation of such a mount manager plugin will take more time and there is no real advantage of holding this PR off until then.

Let me know what you think.

P.S: This PR is still marked as WIP because the hcsshim commit that this PR vendors in is still from my private branch. But rest of the code is ready to review.

@ambarve ambarve force-pushed the cimfs branch 2 times, most recently from 0604de4 to c67c646 Compare July 31, 2023 18:17
Copy link
Contributor

@TBBle TBBle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been through this heavily, but my understanding is that this snapshotter is just a thin wrapper around the existing windows snapshotter, to insert imported-from-content-store CIMs as parent layers, but then subsequent locally-created WCIFS layers are used as is the case today, so a container running a locally-built container images will look like CIMFS+CIMFS+WCIFS+WCIFS+Scratch.

Why isn't this part of the wcowSnapshotter, enabled when OS support is present, and maybe with a feature flag? As it stands, this only applies to layers imported from the content store, which wcowSnapshotter is already aware of and handles specially, so being more special should be fine, and won't require users to make config changes.

For its current use-case of imported-from-content-store layers, the only thing you lose is local host mount support for layer-creation use, and I understand it's desired to fix that. The current windowsSnapshotter can't export/diff imported-from-content-store layers either, so that lack isn't a blocker for this to be the default snapshotter on Windows.

Anyway, if I'm parsing this correctly, the snapshotter as-is would be suitable to use in a CRI config under a k8s cluster or similar situation where you only ever import layers from the content store and run a single scratch layer on top of them.

Hmm... Does CimFS (or UnionFS) support not creating a scratch layer at all for a read-only root mount? I'm not sure I've ever tried read-only root on a Windows container.


It seems most of the heavy lifted is done by hcsshim when it sees a cimfs mount type, and there's nothing equivalent to func (s *wcowSnapshotter) convertScratchToReadOnlyLayer for creating new CIM layers.

In the future, it'd be nice if convertScratchToReadOnlyLayer or equivalent could output CIM layers, since that'd let us speed-up build times for large filesystem layers (which currently pay the price of backuptar twice, see #5384), and not require a push-delete-and-pull to take advantage of the speed increases from CIM usage when creating the next layer.

Separately, if we had tar export support for CIM layers, then we could delete the local sandbox.vhdx after convertScratchToReadOnlyLayer, and save like 50% of the storage cost of locally-built Windows container layers. (As noted in #5384, this might already have been solved in the RS5 computestorage API, but I never got any of those working in containerd before I went on sabbatical.)

Edit: I looked again at the CimFS Windows API docs, and there's actually nothing to extract data from a CIM except to mount it as a filesystem? It's a write-only API? Maybe we end up needing Windows-equivalent of the walking differ, and that solves #5384 too.


Lack of mounting support will prevent the snapshot test suite being able to be run. (The Import/Export Layer tests can still run without mounting, but you also don't have Export support here, so it's really just the ImportLayer test that can be run in the current state.)

My current understanding from the discussion is that the lack of mounting support is more about containerd mount management than underlying infeasibility.

We'd need to support CIMFS+CIMFS+WCIFS+Scratch mounting as single volume or mount-point. That support could be exported from hcsshim, and could also support WCIFS-Base+WCIFS+Scratch in a single API (or two, one CIMFS and one non-CIMFS, since we know the CIMFS-ness from the Mount object), hiding the Apply/Prepare dance from containerd, (and maybe even the BindFS call? Maybe UnionFS replaces this call?) i.e. Windows mount code could look like Windows Apply code, and be just a single hand-off to hcsshim rather than a bunch of hcsshim API calls with no intervening logic.

That said, I also haven't been through the hcsshim-side code, so I don't know if this is actually possible, or if CIMFS+CIMFS+WCIFS+Scratch is actually only possible inside a container. (Although that'd block host-process too, right? Or maybe the recent silo changes made this trade-off, in exchange for supporting the virtual filesystem thing whose name escapes me.)

It'd be nice if we also get read-only mount support (CIMFS+CIMFS+WCIFS, no scratch) out of this to save creating a new sandbox VHDX for Snapshotter.View, but that's only tangential here. (Maybe "UnionFS" can solve this? Will it support WCIFS layers too?)

Comment on lines 87 to 88
// Container Layer written in the CIM format. The caller must be holding
// SeBackupPrivilege and SeRestorePrivilege.
Copy link
Contributor

@TBBle TBBle Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #8346, I thought these privileges were not going to be necessary anymore?

I see the same comment at https://github.com/ambarve/hcsshim/blob/26f23d5c59383ccb23e8d7a99c80c9fc29cdf62f/pkg/ociwclayer/cim/import.go#L25-L26, but can't see anything at https://learn.microsoft.com/en-us/windows/win32/api/_cimfs/ that explains this need.

I also note that the differ appears to deliberately not take these privileges, i.e. that's the main difference between windowsDiff.Apply and cimDiff.Apply, so I assume this is a documentation cut-and-paste issue.

I see the CimFS sample code takes those privileges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I forgot to update these comments. I have fixed it here. I will fix it in hcsshim too.

@ambarve
Copy link
Contributor Author

ambarve commented Aug 15, 2023

I like the idea of combining the cimfs snapshotter with the existing windows snapshotter. Originally, we created a separate snapshotter because the cimfs snapshotter used to automatically mount the cim whenever a scratch layer was created with a cim based parent layer (hcsshim didn't mount cims in that case). But now that we don't have this, I am open to the idea of combining these two snapshotters. Let me look into it a bit more.

When you say, "The current windowsSnapshotter can't export/diff imported-from-content-store layers either", what do you mean? A snapshotter isn't responsible for generating the diff, right? For now, the only intended use of this snapshotter is for imported image layers. We currently don't support exorting a cim based layer or a scratch layer (created with UnionFS) to tar but we do plan to add that support in the near future.

Mounting support, especially in a way that handles both legacy (WCIFS) & CimFS based layers is somewhat complicated. For example, current mount support for windows snapshots is only used for locally mounting the snapshots (for image builds etc.). It doesn't handle the snapshot mounts by a shim process. Plus, even if we add the mount support for CimFS based snapshots in containerd right now, that won't be of much use as layer export isn't supported. That's why, we plan to get this PR merged in first and then continue working on adding layer export & mount support.

Btw, I would like to clarify that we won't be using WCIFS with CimFS at all. CimFS will be used with UnionFS.

@ambarve ambarve force-pushed the cimfs branch 2 times, most recently from 72a2142 to 17c626d Compare August 15, 2023 21:31
@TBBle
Copy link
Contributor

TBBle commented Aug 16, 2023

When you say, "The current windowsSnapshotter can't export/diff imported-from-content-store layers either", what do you mean? A snapshotter isn't responsible for generating the diff, right?

What I mean is that even with the existing WCIFS-based Snapshotter, when you import a layer from the content store, you only have the WCIFS read-only layer (files-on-disk). And hcsshim's v1 layer API can only generate a tar stream from diffing a WCIFS read-write layer (VHDX file) against its parent read-only layer, not from a read-only layer against its parent read-only layer.

So yeah, the Snapshotter doesn't create the diff, I just tie them together in my head. Frankly, I feel like a Snapshot plugin should be able to provide tarstream handlers in a neater way than the existing AsWindowsContainerLayer/applyWindowsContainerLayer hack. Surely there are or are going to be other snapshotters that can generate a tarstream diff more efficiently than the naive walking-diff. (btrfs snapshots include diff-output support using btrfs send but it might not be suitable anyway). (Thinking about that, I suspect that hack was wrong-headed, and the relevant code should be moved directly into the windows differ plugin, since that's the only thing that activates those code paths, and I suspect the remaining dependencies on the rest of the archive library is minimal and/or public symbols. Separate refactoring though. I wish I'd thought of that at the start of the summer holidays, not immediately after the finished.)

Anyway, see

// NOTE: We do not delete the sandbox.vhdx here, as that will break later calls to
// ociwclayer.ExportLayerToTar for this snapshot.
// As a consequence, the data for this layer is held twice, once on-disk and once
// in the sandbox.vhdx.
// TODO: This is either a bug or misfeature in hcsshim, so will need to be resolved
// there first.
and related rambling at #5384 (comment) for the Snapshotter differ issue I was referring to here.

It's possible hcsshim's v2 layer API (computestorage) allows exporting from diffing two WCIFS read-only layers (there's a constant that suggests this is possible), but I never got computestorage's hcsshim projection able to do anything before I went on sabbatical, so I never tried it.


Your comment makes sense to me. We definitely need export before mount (I'm somewhat familiar with the ordering of features needed here. ^_^) so I'm mostly interested in what's future-feasible versus what's impossible, in terms of how this feature is positioned/advertised/used.


I was thinking about this, and it seemed like the CimFS C++ API docs implied that the intent was that a single CIM would hold multiple layers, e.g. the deduplication features and forks, but this implementation is using a CIM-per-layer?

I thought the multiple layers in one big CIM file would be nice, because that would deduplicate between multiple patch versions of the Windows OS second layer, which I suspect is 95% identical between adjacent patches; I haven't checked this though, so that's ±95%.

@ambarve
Copy link
Contributor Author

ambarve commented Aug 22, 2023

I will have to investigate a bit more to find out if currently there is a way of generating a diff from two read-only layers. But for CimFs, we do plan on adding that support i.e. we will add a way to read a layer CIM into a tar.

CimFs is designed to deduplicate layer data at the block level. It is also designed to efficiently merge multiple CIMs to generate a merged view of multiple layers. So it makes more sense to create an individual CIM per layer. That way multiple images with common layers, can also reuse the common CIMs. Even if multiple layers of an image contain duplicate files (or files with very minor changes), CimFs should be able to deduplicate those blocks files.

Edit:
My earlier comment about CimFs deduplicating at block level wasn't accurate. Currently, CimFs only dedups at the file level. (However, block based deduplication might be added in the future).

@ambarve ambarve marked this pull request as ready for review August 22, 2023 00:24
@mxpv mxpv added the status/needs-update Awaiting contributor update label Sep 12, 2023
@kevpar
Copy link
Member

kevpar commented Sep 26, 2023

In general, I think the interface between containerd and the hcsshim code (primarily the ociwclayer/cim package) should be less coupled and more explicit in how it functions.

For instance, rather than passing a path to a directory in the CimFS mount, and assuming that for directory <id> there is a cim at cim-layers/<id>.cim, I think it would be better to explicitly give the path to the cim file in the mount.

I hope that this can reduce coupling between the two packages, e.g. so that they no longer need to agree on a common directory structure with cim-layers. This will also make it easier if we want to reorganize the filesystem structure for the CimFS snapshotter, since it can change without needing to update hcsshim code.

The way I view this is the code in hcsshim should be written to be generic as to where it's being used (for instance, it doesn't need to know what a "snapshotter" is or how it's organized). Then the containerd code enforces the specific structure and organization that it wants for the snapshotter.

@ambarve ambarve force-pushed the cimfs branch 2 times, most recently from bc22f4c to eb2314b Compare October 17, 2023 23:14
@ambarve
Copy link
Contributor Author

ambarve commented Oct 17, 2023

In general, I think the interface between containerd and the hcsshim code (primarily the ociwclayer/cim package) should be less coupled and more explicit in how it functions.

For instance, rather than passing a path to a directory in the CimFS mount, and assuming that for directory <id> there is a cim at cim-layers/<id>.cim, I think it would be better to explicitly give the path to the cim file in the mount.

I hope that this can reduce coupling between the two packages, e.g. so that they no longer need to agree on a common directory structure with cim-layers. This will also make it easier if we want to reorganize the filesystem structure for the CimFS snapshotter, since it can change without needing to update hcsshim code.

The way I view this is the code in hcsshim should be written to be generic as to where it's being used (for instance, it doesn't need to know what a "snapshotter" is or how it's organized). Then the containerd code enforces the specific structure and organization that it wants for the snapshotter.

Discussed this with Kevin offline. We want to get initial CimFS changes merged before containerd 2.0 release. To avoid any further delays for that plans we will merge the CimFS changes without any further refactoring of the layer management code in hcsshim. However, as soon as these changes are merged, we will make additional changes to refactor layer management functions in hcsshim so that they allow a more flexible interface for different types of snapshotter Mounts when running containers.

@ambarve ambarve force-pushed the cimfs branch 8 times, most recently from 52f79cb to 175722d Compare November 7, 2023 22:28
Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to get the code in to enable usage of CimFS in 2.0.

As discussed previously, I still believe more work is needed to better decouple the containerd snapshotter/differ from the implementation in hcsshim. Amit is planning on doing this work next, and I don't think this PR needs to be blocked on it. The decoupling work should be done before CimFS becomes the default snapshotter (which I guess would likely be for 2.1 at this point).

@k8s-ci-robot
Copy link

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Details about CimFs project are discussed in containerd#8346

Signed-off-by: Amit Barve <[email protected]>
@kevpar kevpar added this pull request to the merge queue Dec 22, 2023
Merged via the queue into containerd:main with commit 124bc0d Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants