Skip to content

Pack multiple layers onto a single Virtual PMem device #940

@anmaxvl

Description

@anmaxvl

Overview

Container layers are mounted as VPMem devices and each layer occupies an entire VPMem device and we use SCSI devices once we run out of VPMem slots.
This is generally ok for containers with few layers or (e.g.) non k8s scenarios where a single pod doesn't contain multiple containers, however there's a hard limit on the number of layers we can mount given this approach.

As of Fe (?) release multiple storage devices can be mapped with offsets onto a single VPMem device. On the UVM side, the PMem devices will need to be handled by linux kernel's device-mapper, where a single VPMem block device will be split into multiple linear targets, each representing a container layer. Those linear targets will then be mounted and as usually combined into single union fs.

Given that LCOW layers are generally small we should be able to pack multiple layers onto a single (large-ish) VPMem device, which will significantly bump the layer limit and thus number of containers we'll be able to run in a single pod, plus we can avoid using slower (proof?) SCSI disks for read-only layers.

Approach

To accomplish above, we'd need a couple of things:

  • memory management strategy
  • reuse (when possible) an existing VPMem device to pack more LCOW layers
  • new data structs etc

Memory management

This is something new that needs to be worked on. @kevpar suggested using something similar to buddy memory allocation, where we'd introduce multiple memory "classes" and make allocations based on the minimal class that can hold the layer.
The smallest class can be of size 1MB and the largest one will occupy an entire VPMem (4GB in our case). The offset can be e.g. 2 bytes, i.e. sizes are 1MB -> 4MB -> 16MB etc . We could of course do single byte as well (just like in the original algo), not a big deal here.
Algorithm is fairly simple:

  • try to allocate memory block of class N
  • if there's no available slot, we try to split the next class N+1 into 4, then try N+2 etc
  • if we don't find any free slot fallback to SCSI
type slot struct {
    class uint32
    offset uint64
    size uint64
    next *slot
}

type allocator interface {
    allocate(class uint32) (*slot, error)
    release(class uint32, offset uint64) error
    expand(class uint32) error
    findFreeOffset(class uint32) (offset uint64, err error)
}

NOTE: sorry, naming is hard

Data struct updates

Now that a single VPMem device can hold multiple container layers, we need to track that information somehow, e.g.

type vpmemMapping struct {
    offset, size uint64
    hostPath, uvmPath string
    refCount uint32
}

type vpmemDevice struct {
    memAlloc allocator
    maxSize  uint64
    mappings map[string]*vpmemMapping
}

As an alternative, we could have

    mappings map[uint64]*vpmemMapping

and don't have offset as part of vpmemMapping, but it's easier to lookup by layer path directly, rather than iterating over offset/mapping pairs.

Layer packing on host

As mentioned above multiple vhds can be mapped onto a single VPMem device, this is done via passing device-mappings resource path, which consists of device number and offset.
The first mount will still be a regular mount via vpmem-controller (but now that I think about it, maybe it's possible to give it the device-mapping resource path right away?), subsequent layer mounts will try to reuse already existing VPMem devices.
The flow is the same:

  • find layer
  • if no layer found, find next VPMem that can hold the layer
    • if VPMem is new, add it via controller API
    • if VPMem is already added, map layer using mapping API
  • if no VPMem found, use SCSI

device-mapper on guest

Related work to support this feature has been done here: microsoft/opengcs#389
In short opengcs APIs have been updated to also accept MappingInfo as part of VPMem mount request and if present, device-mapper will create linear target and that target will be mounted at /run/layers/pX-Y-Z, with X==device number, Y==device offset, Z==device size in bytes.
the offset/size values need to be page and block aligned

UVM god object vs composition?

This part will be partially implemented to keep the changes relatively minimal.
UtilityVM struct itself seems to be pretty overloaded and handling too much stuff (vpmem/scsi/vsmb/vpci etc), so maybe instead of that, we could define a "controller" and "device" interfaces and then we can have vpmem/scsi/vsmb etc imlpement those interfaces.
From UVM's perspective then, we'd do something super generic:

for _, controller := range uvm.storage_backends {
    dev := controller.findDevice(layerPath)
    if dev != nil {
        dev.AddRefCount()
        return
    }
    dev, err := controller.findNextDevice(layerPath)
    if err == ErrNoAvailableLocation {
        continue
    }
    request := dev.HostComputeRequest(requesttype.Add)
    err := uvm.modify(request)
    if err != nil {
        continue
    }
    controller.Save(dev)
}

so as part of this work, we can put some effort into decoupling or at least adding some higher level interfaces, which may (or may not) make it possible to simplify certain things and make uvm less godly.

The majority of work has been done in #930

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions