pvmemcontrol: control guest physical memory properties #6467

yuanchu-xie · 2024-05-18T07:37:10Z

I'm working on memory passthrough for lightweight VMs. We've come up with an approach that's guest driven and tries to keep the VM slim proactively. Pvmemcontrol is the name of the device/driver that communicates between the guest and vmm to control the host backing of guest memory.

Yuanchu Xie [email protected]
Pasha Tatashin [email protected] @soleen

Pvmemcontrol provides a way for the guest to control its physical memory
properties, and enables optimizations and security features. For
example, the guest can provide information to the host where parts of a
hugepage may be unbacked, or sensitive data may not be swapped out, etc.

Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
and also some other properties of the memory map the back's host memory.
This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
capability is available, the changes in the backing of the memory region
on the host are automatically reflected into the guest. For example, an
mmap() or madvise() that affects the region will be made visible
immediately.

There are two components of the implementation: the guest Linux driver
and Virtual Machine Monitor (VMM) device. A guest-allocated shared
buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
device assigns a unique command for each per-cpu buffer. The guest
writes its pvmemcontrol request in the per-cpu buffer, then writes the
corresponding command into the command register, calling into the VMM
device to perform the pvmemcontrol request.

The synchronous per-cpu shared buffer approach avoids the kick and busy
waiting that the guest would have to do with virtio virtqueue transport.

User API
From the userland, the pvmemcontrol guest driver is controlled via
ioctl(2) call. It requires CAP_SYS_ADMIN.

ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf);

Guest userland applications can tag VMAs and guest hugepages, or advise
the host on how to handle sensitive guest pages.

Supported function codes and their use cases:
PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce
the struct page and page table lookup overhead by using hugepages backed
by smaller pages on the host. These pvmemcontrol commands can allow for
partial freeing of private guest hugepages to save memory. They also
allow kernel memory, such as kernel stacks and task_structs to be
paravirtualized if we expose kernel APIs.

PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not
want to share its backing pages.
The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included
in a dump.
MLOCK/UNLOCK can advise the host that sensitive information is not
swapped out on the host.

PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages,
stack guard pages can be handled in the host and memory can be saved in
the hugepage.

PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging
how guest memory is being mapped on the host.

Sample program making use of PVMEMCONTROL_DONTNEED:
https://github.com/Dummyc0m/pvmemcontrol-user

Previously posted RFC to cloud-hypervisor:
#6318

LKML posting of Linux guest driver:
https://lore.kernel.org/lkml/[email protected]/

up2wing · 2024-05-20T02:56:30Z

If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.

up2wing · 2024-05-20T02:39:20Z

vmm/src/device_manager.rs

    gpio_device: Option<Arc<Mutex<devices::legacy::Gpio>>>,

+    pvmemcontrol_bus_device: Option<Arc<devices::pvmemcontrol::PvmemcontrolBusDevice>>,
+    pvmemcontrol_pci_device: Option<Arc<Mutex<devices::pvmemcontrol::PvmemcontrolPciDevice>>>,


use devices::pvmemcontrol:: { PvmemcontrolBusDevice, PvmemcontrolPciDevice };

can make this simpler.

up2wing · 2024-05-20T02:40:16Z

devices/src/pvmemcontrol.rs

+    id: String,
+    configuration: PciConfiguration,
+    bar_regions: Vec<PciBarConfiguration>,
+}


Would you like to explain why do you need use two structs to represent the device? In my opinion, one
struct maybe PvmemcontrolDevice seems like enough.

Right. My observation was that both BusDevice and PciDevice handle device writes/reads, but actually only the BusDevice impl received the writes/reads. I want the device to handle requests on multiple cpus at the same time. So I made the BusDeviceSync trait similar to crosvm, which is just the BusDevice trait without the exclusive ref requirement on the read and write trait methods so the impl can handle its own locking, so multiple read locks can be taken at the same time.

I left the PciDevice trait in place, so I need two structs because the PciDevice gets wrapped in an Arc<Mutex<>> when I want a RwLock. On second thought, maybe I should instead refactor the Pci/BusDevice traits such that PciDevice also handles its own locking? That would be more consistent, but also inflate the PR to a tree-wide change.

Dummyc0m · 2024-06-07T22:29:33Z

If I understand correctly, the guest can change/operate the host memory properties using the pvmemcontrol device, which I think may worry some public cloud users. So it might be better to add a feature like guest_debug to control this.

By default the device is not enabled, and I would say this is roughly in the same ballpark as virtio-balloon reporting free pages for the host to madvise away. Would you say that the device should be feature gated?

Dummyc0m · 2024-06-12T02:18:31Z

refreshed kernel patches to resolve sparse warnings https://lore.kernel.org/linux-mm/[email protected]/

liuw · 2024-06-19T22:39:43Z

A few comments:

I think this should be gated by a flag and be disabled by default, because kernel code is not yet upstreamed.
I think you should remove the reference to the prototype in your commit message.
The device is really simple, and the code is self-contained, so I don't worry about it being overly buggy or anything. I can only speak for myself, but I'm happy to merge experimental code like this to nurture innovation.

I know there is a chicken-and-egg problem. Kernel wants to have some users before merging new code, while user space programs are hesitant to take in new code because kernel code can still change. Having the feature merged but disabled by default seems like a good way forward.

Lastly, I know it is not possible to test this right now, but if we merge this, please plan to add a test case when the kernel changes are merged.

Dummyc0m · 2024-06-20T22:12:57Z

Thanks Liu Wei, I agree on all three remarks, plus testing when the kernel changes are merged. Let me make the changes.

Dummyc0m · 2024-06-25T02:19:56Z

Seems like I missed a few things. Let me actually add the pre-commit hooks to my local setup and not forget to run some the checks every time.

liuw

Some minor comments below.

devices/src/pvmemcontrol.rs

liuw · 2024-07-22T15:48:51Z

@novakovic please don't push to the existing branch like that. The top commit you pushed is not signed off. It looks like you're making a minor change in numbering. You patch should be folded into the existing one.

Dummyc0m · 2024-07-22T21:34:51Z

@novakovic please don't push to the existing branch like that. The top commit you pushed is not signed off. It looks like you're making a minor change in numbering. You patch should be folded into the existing one.

Thank you so much for the pointer Wei, I will be folding this change in.

PR not in mergable state.

BusDevice trait functions currently holds a mutable reference to self, and exclusive access is guaranteed by taking a Mutex when dispatched by the Bus object. However, this prevents individual devices from serving accesses that do not require an mutable reference or is better served with different synchronization primitives. We switch Bus to dispatch via BusDeviceSync, which holds a shared reference, and delegate locking to the BusDeviceSync trait implementation for Mutex<BusDevice>. Other changes are made to make use of the dyn BusDeviceSync trait object. Signed-off-by: Yuanchu Xie <[email protected]>

The BusDevice requirement is not needed, only Send is required. Signed-off-by: Yuanchu Xie <[email protected]>

Pvmemcontrol provides a way for the guest to control its physical memory properties, and enables optimizations and security features. For example, the guest can provide information to the host where parts of a hugepage may be unbacked, or sensitive data may not be swapped out, etc. Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT, and also some other properties of the memory map the back's host memory. This is achieved by using the KVM_CAP_SYNC_MMU capability. When this capability is available, the changes in the backing of the memory region on the host are automatically reflected into the guest. For example, an mmap() or madvise() that affects the region will be made visible immediately. There are two components of the implementation: the guest Linux driver and Virtual Machine Monitor (VMM) device. A guest-allocated shared buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM device assigns a unique command for each per-cpu buffer. The guest writes its pvmemcontrol request in the per-cpu buffer, then writes the corresponding command into the command register, calling into the VMM device to perform the pvmemcontrol request. The synchronous per-cpu shared buffer approach avoids the kick and busy waiting that the guest would have to do with virtio virtqueue transport. The Cloud Hypervisor component can be enabled with --pvmemcontrol. Co-developed-by: Stanko Novakovic <[email protected]> Co-developed-by: Pasha Tatashin <[email protected]> Signed-off-by: Yuanchu Xie <[email protected]>

Dummyc0m · 2024-07-25T01:07:16Z

Chaegelog:
Folded @novakovic's change
Incorporated Wei's review comments
Rebased on top of main
Re-tested

rbradford

Please can you add some build testing for this feature in the CI so that it doesn't bitrot. Otherwise lgtm.

liuw · 2024-08-05T22:12:11Z

I have a small patch to add a new build test. I can post that once this is merged.

yuanchu-xie requested a review from a team as a code owner May 18, 2024 07:37

up2wing reviewed May 20, 2024

View reviewed changes

yuanchu-xie force-pushed the memctl-pci branch from e2cdb86 to 3ee271a Compare June 7, 2024 19:05

Dummyc0m force-pushed the memctl-pci branch from 3ee271a to e87cc91 Compare June 7, 2024 22:27

Dummyc0m force-pushed the memctl-pci branch 2 times, most recently from 6b4c78f to d56ecdf Compare June 7, 2024 23:28

Dummyc0m force-pushed the memctl-pci branch 2 times, most recently from ff253bf to de7e144 Compare June 25, 2024 00:55

Dummyc0m force-pushed the memctl-pci branch from de7e144 to c5d87c9 Compare June 25, 2024 20:22

liuw closed this Jul 9, 2024

liuw reopened this Jul 9, 2024

liuw previously approved these changes Jul 9, 2024

View reviewed changes

devices/src/pvmemcontrol.rs Outdated Show resolved Hide resolved

devices/src/pvmemcontrol.rs Outdated Show resolved Hide resolved

devices/src/pvmemcontrol.rs Outdated Show resolved Hide resolved

yuanchu-xie force-pushed the memctl-pci branch from c5d87c9 to d5fd213 Compare July 12, 2024 18:27

Dummyc0m force-pushed the memctl-pci branch from a7c8c43 to cfe7cdd Compare July 25, 2024 00:50

yuanchu-xie added 3 commits July 24, 2024 17:52

pci: Remove BusDevice requirement from PciDevice

7784373

The BusDevice requirement is not needed, only Send is required. Signed-off-by: Yuanchu Xie <[email protected]>

Dummyc0m force-pushed the memctl-pci branch from cfe7cdd to dfd7345 Compare July 25, 2024 01:06

liuw approved these changes Jul 25, 2024

View reviewed changes

rbradford reviewed Aug 1, 2024

View reviewed changes

liuw added this pull request to the merge queue Aug 5, 2024

liuw removed this pull request from the merge queue due to a manual request Aug 5, 2024

liuw added this pull request to the merge queue Aug 5, 2024

Merged via the queue into cloud-hypervisor:main with commit 5f18ac3 Aug 5, 2024

pvmemcontrol: control guest physical memory properties #6467

pvmemcontrol: control guest physical memory properties #6467

Uh oh!

Conversation

yuanchu-xie commented May 18, 2024

Uh oh!

up2wing commented May 20, 2024

Uh oh!

up2wing May 20, 2024

Choose a reason for hiding this comment

Uh oh!

up2wing May 20, 2024

Choose a reason for hiding this comment

Uh oh!

yuanchu-xie May 20, 2024

Choose a reason for hiding this comment

Uh oh!

Dummyc0m commented Jun 7, 2024

Uh oh!

Dummyc0m commented Jun 12, 2024

Uh oh!

liuw commented Jun 19, 2024

Uh oh!

Dummyc0m commented Jun 20, 2024

Uh oh!

Dummyc0m commented Jun 25, 2024

Uh oh!

liuw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liuw commented Jul 22, 2024

Uh oh!

Dummyc0m commented Jul 22, 2024

Uh oh!

Dummyc0m commented Jul 25, 2024

Uh oh!

rbradford left a comment

Choose a reason for hiding this comment

Uh oh!

liuw commented Aug 5, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants