-
Notifications
You must be signed in to change notification settings - Fork 275
Description
Overview
Container layers are mounted as VPMem devices and each layer occupies an entire VPMem device and we use SCSI devices once we run out of VPMem slots.
This is generally ok for containers with few layers or (e.g.) non k8s scenarios where a single pod doesn't contain multiple containers, however there's a hard limit on the number of layers we can mount given this approach.
As of Fe (?) release multiple storage devices can be mapped with offsets onto a single VPMem device. On the UVM side, the PMem devices will need to be handled by linux kernel's device-mapper, where a single VPMem block device will be split into multiple linear targets, each representing a container layer. Those linear targets will then be mounted and as usually combined into single union fs.
Given that LCOW layers are generally small we should be able to pack multiple layers onto a single (large-ish) VPMem device, which will significantly bump the layer limit and thus number of containers we'll be able to run in a single pod, plus we can avoid using slower (proof?) SCSI disks for read-only layers.
Approach
To accomplish above, we'd need a couple of things:
- memory management strategy
- reuse (when possible) an existing VPMem device to pack more LCOW layers
- new data structs etc
Memory management
This is something new that needs to be worked on. @kevpar suggested using something similar to buddy memory allocation, where we'd introduce multiple memory "classes" and make allocations based on the minimal class that can hold the layer.
The smallest class can be of size 1MB and the largest one will occupy an entire VPMem (4GB in our case). The offset can be e.g. 2 bytes, i.e. sizes are 1MB -> 4MB -> 16MB etc . We could of course do single byte as well (just like in the original algo), not a big deal here.
Algorithm is fairly simple:
- try to allocate memory block of class
N - if there's no available slot, we try to split the next class
N+1into 4, then tryN+2etc - if we don't find any free slot fallback to SCSI
type slot struct {
class uint32
offset uint64
size uint64
next *slot
}
type allocator interface {
allocate(class uint32) (*slot, error)
release(class uint32, offset uint64) error
expand(class uint32) error
findFreeOffset(class uint32) (offset uint64, err error)
}
NOTE: sorry, naming is hard
Data struct updates
Now that a single VPMem device can hold multiple container layers, we need to track that information somehow, e.g.
type vpmemMapping struct {
offset, size uint64
hostPath, uvmPath string
refCount uint32
}
type vpmemDevice struct {
memAlloc allocator
maxSize uint64
mappings map[string]*vpmemMapping
}
As an alternative, we could have
mappings map[uint64]*vpmemMapping
and don't have offset as part of vpmemMapping, but it's easier to lookup by layer path directly, rather than iterating over offset/mapping pairs.
Layer packing on host
As mentioned above multiple vhds can be mapped onto a single VPMem device, this is done via passing device-mappings resource path, which consists of device number and offset.
The first mount will still be a regular mount via vpmem-controller (but now that I think about it, maybe it's possible to give it the device-mapping resource path right away?), subsequent layer mounts will try to reuse already existing VPMem devices.
The flow is the same:
- find layer
- if no layer found, find next VPMem that can hold the layer
- if VPMem is new, add it via controller API
- if VPMem is already added, map layer using mapping API
- if no VPMem found, use SCSI
device-mapper on guest
Related work to support this feature has been done here: microsoft/opengcs#389
In short opengcs APIs have been updated to also accept MappingInfo as part of VPMem mount request and if present, device-mapper will create linear target and that target will be mounted at /run/layers/pX-Y-Z, with X==device number, Y==device offset, Z==device size in bytes.
the offset/size values need to be page and block aligned
UVM god object vs composition?
This part will be partially implemented to keep the changes relatively minimal.
UtilityVM struct itself seems to be pretty overloaded and handling too much stuff (vpmem/scsi/vsmb/vpci etc), so maybe instead of that, we could define a "controller" and "device" interfaces and then we can have vpmem/scsi/vsmb etc imlpement those interfaces.
From UVM's perspective then, we'd do something super generic:
for _, controller := range uvm.storage_backends {
dev := controller.findDevice(layerPath)
if dev != nil {
dev.AddRefCount()
return
}
dev, err := controller.findNextDevice(layerPath)
if err == ErrNoAvailableLocation {
continue
}
request := dev.HostComputeRequest(requesttype.Add)
err := uvm.modify(request)
if err != nil {
continue
}
controller.Save(dev)
}
so as part of this work, we can put some effort into decoupling or at least adding some higher level interfaces, which may (or may not) make it possible to simplify certain things and make uvm less godly.
The majority of work has been done in #930