Skip to content

Conversation

@yuanchu-xie
Copy link
Contributor

I'm working on memory passthrough for lightweight VMs. We've come up with an approach that's guest driven and tries to keep the VM slim proactively. Pvmemcontrol is the name of the device/driver that communicates between the guest and vmm to control the host backing of guest memory.

Yuanchu Xie [email protected]
Pasha Tatashin [email protected] @soleen


Pvmemcontrol provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.

Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
and also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.

There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its pvmemcontrol request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the pvmemcontrol request.

The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.

User API
From the userland, the pvmemcontrol guest driver is controlled via
ioctl(2) call. It requires CAP_SYS_ADMIN.

ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf);

Guest userland applications can tag VMAs and guest hugepages, or advise
the host on how to handle sensitive guest pages.

Supported function codes and their use cases:
PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce
the struct page and page table lookup overhead by using hugepages backed
by smaller pages on the host. These pvmemcontrol commands can allow for
partial freeing of private guest hugepages to save memory. They also
allow kernel memory, such as kernel stacks and task_structs to be
paravirtualized if we expose kernel APIs.

PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not
want to share its backing pages.
The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included
in a dump.
MLOCK/UNLOCK can advise the host that sensitive information is not
swapped out on the host.

PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages,
stack guard pages can be handled in the host and memory can be saved in
the hugepage.

PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging
how guest memory is being mapped on the host.

Sample program making use of PVMEMCONTROL_DONTNEED:
https://github.com/Dummyc0m/pvmemcontrol-user

Previously posted RFC to cloud-hypervisor:
#6318

LKML posting of Linux guest driver:
https://lore.kernel.org/lkml/[email protected]/

@yuanchu-xie yuanchu-xie requested a review from a team as a code owner May 18, 2024 07:37
@up2wing
Copy link
Contributor

up2wing commented May 20, 2024

If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.

gpio_device: Option<Arc<Mutex<devices::legacy::Gpio>>>,

pvmemcontrol_bus_device: Option<Arc<devices::pvmemcontrol::PvmemcontrolBusDevice>>,
pvmemcontrol_pci_device: Option<Arc<Mutex<devices::pvmemcontrol::PvmemcontrolPciDevice>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use devices::pvmemcontrol:: { PvmemcontrolBusDevice, PvmemcontrolPciDevice };

can make this simpler.

id: String,
configuration: PciConfiguration,
bar_regions: Vec<PciBarConfiguration>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to explain why do you need use two structs to represent the device? In my opinion, one
struct maybe PvmemcontrolDevice seems like enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My observation was that both BusDevice and PciDevice handle device writes/reads, but actually only the BusDevice impl received the writes/reads. I want the device to handle requests on multiple cpus at the same time. So I made the BusDeviceSync trait similar to crosvm, which is just the BusDevice trait without the exclusive ref requirement on the read and write trait methods so the impl can handle its own locking, so multiple read locks can be taken at the same time.

I left the PciDevice trait in place, so I need two structs because the PciDevice gets wrapped in an Arc<Mutex<>> when I want a RwLock. On second thought, maybe I should instead refactor the Pci/BusDevice traits such that PciDevice also handles its own locking? That would be more consistent, but also inflate the PR to a tree-wide change.

@Dummyc0m
Copy link

Dummyc0m commented Jun 7, 2024

If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.

By default the device is not enabled, and I would say this is roughly in the same ballpark as virtio-balloon reporting free pages for the host to madvise away. Would you say that the device should be feature gated?

@Dummyc0m Dummyc0m force-pushed the memctl-pci branch 2 times, most recently from 6b4c78f to d56ecdf Compare June 7, 2024 23:28
@Dummyc0m
Copy link

@liuw
Copy link
Member

liuw commented Jun 19, 2024

A few comments:

  1. I think this should be gated by a flag and be disabled by default, because kernel code is not yet upstreamed.
  2. I think you should remove the reference to the prototype in your commit message.
  3. The device is really simple, and the code is self-contained, so I don't worry about it being overly buggy or anything. I can only speak for myself, but I'm happy to merge experimental code like this to nurture innovation.

I know there is a chicken-and-egg problem. Kernel wants to have some users before merging new code, while user space programs are hesitant to take in new code because kernel code can still change. Having the feature merged but disabled by default seems like a good way forward.

Lastly, I know it is not possible to test this right now, but if we merge this, please plan to add a test case when the kernel changes are merged.

@Dummyc0m
Copy link

Thanks Liu Wei, I agree on all three remarks, plus testing when the kernel changes are merged. Let me make the changes.

@Dummyc0m Dummyc0m force-pushed the memctl-pci branch 2 times, most recently from ff253bf to de7e144 Compare June 25, 2024 00:55
@Dummyc0m
Copy link

Seems like I missed a few things. Let me actually add the pre-commit hooks to my local setup and not forget to run some the checks every time.

@liuw liuw closed this Jul 9, 2024
@liuw liuw reopened this Jul 9, 2024
liuw
liuw previously approved these changes Jul 9, 2024
Copy link
Member

@liuw liuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments below.

@liuw
Copy link
Member

liuw commented Jul 22, 2024

@novakovic please don't push to the existing branch like that. The top commit you pushed is not signed off. It looks like you're making a minor change in numbering. You patch should be folded into the existing one.

@Dummyc0m
Copy link

@novakovic please don't push to the existing branch like that. The top commit you pushed is not signed off. It looks like you're making a minor change in numbering. You patch should be folded into the existing one.

Thank you so much for the pointer Wei, I will be folding this change in.

@rbradford rbradford dismissed liuw’s stale review July 23, 2024 10:47

PR not in mergable state.

BusDevice trait functions currently holds a mutable reference to self,
and exclusive access is guaranteed by taking a Mutex when dispatched by
the Bus object. However, this prevents individual devices from serving
accesses that do not require an mutable reference or is better served
with different synchronization primitives. We switch Bus to dispatch via
BusDeviceSync, which holds a shared reference, and delegate locking to
the BusDeviceSync trait implementation for Mutex<BusDevice>.

Other changes are made to make use of the dyn BusDeviceSync
trait object.

Signed-off-by: Yuanchu Xie <[email protected]>
The BusDevice requirement is not needed, only Send is required.

Signed-off-by: Yuanchu Xie <[email protected]>
Pvmemcontrol provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.

Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
and also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.

There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its pvmemcontrol request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the pvmemcontrol request.

The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.

The Cloud Hypervisor component can be enabled with --pvmemcontrol.

Co-developed-by: Stanko Novakovic <[email protected]>
Co-developed-by: Pasha Tatashin <[email protected]>
Signed-off-by: Yuanchu Xie <[email protected]>
@Dummyc0m
Copy link

Chaegelog:
Folded @novakovic's change
Incorporated Wei's review comments
Rebased on top of main
Re-tested

Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you add some build testing for this feature in the CI so that it doesn't bitrot. Otherwise lgtm.

@liuw
Copy link
Member

liuw commented Aug 5, 2024

I have a small patch to add a new build test. I can post that once this is merged.

@liuw liuw added this pull request to the merge queue Aug 5, 2024
@liuw liuw removed this pull request from the merge queue due to a manual request Aug 5, 2024
@liuw liuw added this pull request to the merge queue Aug 5, 2024
Merged via the queue into cloud-hypervisor:main with commit 5f18ac3 Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

5 participants