Pack multiple layers onto a single Virtual PMem device

# Overview
Container layers are mounted as VPMem devices and each layer occupies an entire VPMem device and we use SCSI devices once we run out of VPMem slots.
This is generally ok for containers with few layers or (e.g.) non k8s scenarios where a single pod doesn't contain multiple containers, however there's a hard limit on the number of layers we can mount given this approach.

As of `Fe` (?) release multiple storage devices can be mapped with offsets onto a single VPMem device. On the UVM side, the PMem devices will need to be handled by linux kernel's device-mapper, where a single VPMem block device will be split into multiple linear targets, each representing a container layer. Those linear targets will then be mounted and as usually combined into single union fs.

Given that LCOW layers are generally small we should be able to pack multiple layers onto a single (large-ish) VPMem device, which will significantly bump the layer limit and thus number of containers we'll be able to run in a single pod, plus we can avoid using slower (proof?) SCSI disks for read-only layers.

# Approach
To accomplish above, we'd need a couple of things:
* memory management strategy
* reuse (when possible) an existing VPMem device to pack more LCOW layers
* new data structs etc

## Memory management
This is something new that needs to be worked on. @kevpar suggested using something similar to `buddy memory allocation`, where we'd introduce multiple memory "classes" and make allocations based on the minimal class that can hold the layer.
The smallest class can be of size 1MB and the largest one will occupy an entire VPMem (4GB in our case). The offset can be e.g. 2 bytes, i.e. sizes are 1MB -> 4MB -> 16MB etc . We could of course do single byte as well (just like in the original algo), not a big deal here.
Algorithm is fairly simple: 
* try to allocate memory block of class `N`
* if there's no available slot, we try to split the next class `N+1` into 4, then try `N+2` etc
* if we don't find any free slot fallback to SCSI

```
type slot struct {
    class uint32
    offset uint64
    size uint64
    next *slot
}

type allocator interface {
    allocate(class uint32) (*slot, error)
    release(class uint32, offset uint64) error
    expand(class uint32) error
    findFreeOffset(class uint32) (offset uint64, err error)
}
```
NOTE: sorry, naming is hard

## Data struct updates
Now that a single VPMem device can hold multiple container layers, we need to track that information somehow, e.g.
```
type vpmemMapping struct {
    offset, size uint64
    hostPath, uvmPath string
    refCount uint32
}

type vpmemDevice struct {
    memAlloc allocator
    maxSize  uint64
    mappings map[string]*vpmemMapping
}
```
As an alternative, we could have
```
    mappings map[uint64]*vpmemMapping
```
and don't have `offset` as part of `vpmemMapping`, but it's easier to lookup by layer path directly, rather than iterating over offset/mapping pairs.

## Layer packing on host
As mentioned above multiple vhds can be mapped onto a single VPMem device, this is done via passing [device-mappings](https://github.com/microsoft/hcsshim/blob/master/internal/uvm/resourcepaths.go#L20) resource path, which consists of device number and offset.
The first mount will still be a regular mount via [vpmem-controller](https://github.com/microsoft/hcsshim/blob/master/internal/uvm/resourcepaths.go#L19) (but now that I think about it, maybe it's possible to give it the device-mapping resource path right away?), subsequent layer mounts will try to reuse already existing VPMem devices.
The flow is the same:
* find layer
* if no layer found, find next VPMem that can hold the layer
    * if VPMem is new, add it via controller API
    * if VPMem is already added, map layer using mapping API
* if no VPMem found, use SCSI

## device-mapper on guest
Related work to support this feature has been done here: https://github.com/microsoft/opengcs/pull/389
In short opengcs APIs have been updated to also accept `MappingInfo` as part of VPMem mount request and if present, device-mapper will create linear target and that target will be mounted at `/run/layers/pX-Y-Z`, with X==device number, Y==device offset, Z==device size in bytes.
the offset/size values need to be page and block aligned

## UVM god object vs composition?
This part will be partially implemented to keep the changes relatively minimal.
`UtilityVM` struct itself seems to be pretty overloaded and handling too much stuff (vpmem/scsi/vsmb/vpci etc), so maybe instead of that, we could define a "controller" and "device" interfaces and then we can have vpmem/scsi/vsmb etc imlpement those interfaces.
From UVM's perspective then, we'd do something super generic:
```
for _, controller := range uvm.storage_backends {
    dev := controller.findDevice(layerPath)
    if dev != nil {
        dev.AddRefCount()
        return
    }
    dev, err := controller.findNextDevice(layerPath)
    if err == ErrNoAvailableLocation {
        continue
    }
    request := dev.HostComputeRequest(requesttype.Add)
    err := uvm.modify(request)
    if err != nil {
        continue
    }
    controller.Save(dev)
}
```
so as part of this work, we can put some effort into decoupling or at least adding some higher level interfaces, which may (or may not) make it possible to simplify certain things and make uvm less godly.

The majority of work has been done in https://github.com/microsoft/hcsshim/pull/930

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pack multiple layers onto a single Virtual PMem device #940

Overview

Approach

Memory management

Data struct updates

Layer packing on host

device-mapper on guest

UVM god object vs composition?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pack multiple layers onto a single Virtual PMem device #940

Description

Overview

Approach

Memory management

Data struct updates

Layer packing on host

device-mapper on guest

UVM god object vs composition?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions