-
Notifications
You must be signed in to change notification settings - Fork 275
Description
Currently for LCOW for every container launched in the UVM we create a sandbox.vhdx on the host of whatever size specified (default of 20GB) and mount this in as a SCSI disk to be used as the containers scratch space. I've done some work to make it so that it's possible for other containers to re-use the scratch space of a previously launched container, so they all share the same GB of available disk space. This work touches a lot of layers of our stack so this issue will detail the current proposed flow for this and will hopefully be used to foster some discussion on how/if this could be handled better. The current flow is as follows and assumes that the containers scratch we're sharing is the Kubernetes pod sandbox container:
-
User either passes in the annotation
containerd.io/snapshot/io.microsoft.container.storage.reuse-scratchon the pod config and for any container configurations that will be used for containers launched in this pod, or setsShareScratch = truein containerd.toml. -
RunPodSandbox called.
-
In the Cri plugin, if the containerd.toml
ShareScratchvalue or the annotation was set, two more annotations are set by the plugin itself that are crucial for this to work. These arecontainerd.io/snapshot/io.microsoft.container.storage.reuse-scratch.container-typeandcontainerd.io/snapshot/io.microsoft.sandbox.id. The container-type annotation only has two values that it can be set to that the snapshotter will understand. A value of "sandbox" specifies if this is a sandbox container and will be the scratch space that will be re-used for future containers. A value of "container" specifies if this is a container that will be sharing the scratch from a sandbox. I don't like how tied I made the naming here to the kubernetes concepts so please if anyone has some better ideas here do share. The second annotationsandbox.idwill be used later on to aid a container sharing this scratch space in finding where the scratch for the sandbox is located. -
CreateContainer is called in RunPodSandbox and the Cri plugin will pass in all of the above annotations down to the snapshotter.
-
During the r/w snapshot creation, if the annotation (referred to as labels in the snapshotter and other places in containerd anywhere a map[string]string of options is passed to any component) isn't set, then the usual LCOW sandbox.vhdx creation ensues and nothing changes. If it is set then the type of container is checked first to determine how to handle things. Let's continue with the scenario where this is the sandbox. In the sandbox case the usual scratch creation or copying occurs except at the end we set a new label with the key being in the format of
containerd.io/snapshot/io.microsoft.container.storage.reuse-scratch.sandbox-%swhere %s will be filled in with the sandbox containers ID that we've passed in (the snapshot key for the final r/w snapshot will have this ID), and the value being the newly created snapshots path so we can find this later on. Once the label is added we then write the snapshot info (contains the labels) to the metadata file containerd uses for retrieval later on. Sandbox scratch snapshot creation is done at this point. I will skip the sandbox container creation/containerd-shim portions for this as nothing changes. -
For any containers launched thereafter in the pod that specify that they'd like to share scratch the flow is a bit different. We create the snapshot directory as usual but we're not going to copy or create a new scratch vhdx for it. We first look at the sandbox ID label to see what container we'd like to share the scratch of. We then walk the list of committed snapshots in the containers namespace we're in (k8s for our case) and look for a snapshot where the key contains a substring match with the sandbox ID (the full key has the namespace and snapshot ID prepended e.g. ks.io/5/ so we need the full string to actually perform the next step). We take note of the key that matched and then use this to
Statthe snapshot which will return the information we saved to disk earlier. We use the label we wrote to disk that contains the path to the sandbox containers scratch space to create a symlink to the vhdx. Now we're finally done on the snapshotter side of things. -
Misc. other CreateContainer things in containerd --> We call into the shim to start a task (container)
-
On the shim side of things not too much is different. We read the layer folders as normal but for the last layer where the scratch is situated we evaluate any symlinks now. As the symlink points to a previously mounted scsi disk we can just let the refcounting we already have for scsi disks handle this. When we go to make the final
CombineLayersrequest we run into the problem of we're going to pass in a path that we've already used to setup the overlay filesystem (this path is where the upper and work directories for the overlay are setup) for a previous container. There's a couple options we can take here but I'll detail the road we'd likely want to head down. We can see that the scsi mount object we get back from theAddSCSIcall has a ref count higher than 1, which means we've already used this before and will need to pass in a different path in the scratch space to setup. What format we'd like to go with here is TBD but any suggestions are appreciated. We don't have access to the containers ID inMountContainerLayersso maybecontainer_nwhere n is the refcount? The other option which I have here, which is not too ideal but works is to just check in opengcs if the work and upper directories exist already in the scratch space and just append a random identifier to the end of the to be made new directories. Support for making multiple overlays in the same scratch path opengcs#383