Nested Virtualization in Intel Systems
Nested Virtualization in Intel Systems
Muli Ben-Yehuda† Michael D. Day‡ Zvi Dubitzky† Michael Factor† Nadav Har’El†
[email protected] [email protected] [email protected] [email protected] [email protected]
† ‡ †
Abel Gordon Anthony Liguori Orit Wasserman Ben-Ami Yassour†
[email protected] [email protected] [email protected] [email protected]
† ‡
IBM Research – Haifa IBM Linux Technology Center
2
our approach allows both hypervisors to take advantage
of the virtualization hardware, leading to a more efficient
implementation.
3
cides whether that trap should be handled by L0 (e.g., the processor, L0 uses it to emulate a VMX enabled CPU
because the trapping context was L1 ) or whether to for- for L1 .
ward it to the responsible hypervisor (e.g., because the
trap occurred in L2 and should be handled by L1 ). In the
latter case, L0 forwards the trap to L1 for handling.
When there are n levels of nesting guests, but the hard-
ware supports less than n levels of MMU or DMA trans-
lation tables, the n levels need to be compressed onto the
levels available in hardware, as described in Sections 3.3
and 3.4.
4
VMCS 1→2 to construct a new VMCS ( VMCS 0→2 ) that 3.3 MMU: Multi-dimensional Paging
holds L2 ’s environment from L0 ’s perspective.
L0 must consider all the specifications defined In addition to virtualizing the CPU, a hypervisor also
in VMCS1→2 and also the specifications defined in needs to virtualize the MMU: A guest OS builds a guest
VMCS 0→1 to create VMCS 0→2 . The host state defined in page table which translates guest virtual addresses to
VMCS 0→2 must contain the values required by the CPU guest physical addresses. These must be translated again
to correctly switch back from L2 to L0 . In addition, into host physical addresses. With nested virtualization,
VMCS 1→2 host state must be copied to VMCS 0→1 guest a third layer of address translation is needed.
state. Thus, when L0 emulates a switch between L2 to These translations can be done entirely in software,
L1 , the processor loads the correct L1 specifications. or assisted by hardware. However, as we explain be-
The guest state stored in VMCS1→2 does not require low, current hardware supports only one or two dimen-
any special handling in general, and most fields can be sions (levels) of translation, not the three needed for
copied directly to the guest state of VMCS0→2 . nested virtualization. In this section we present a new
The control data of VMCS1→2 and VMCS0→1 must be technique, multi-dimensional paging, for multiplexing
merged to correctly emulate the processor behavior. For the three needed translation tables onto the two avail-
example, consider the case where L1 specifies to trap an able in hardware. In Section 4.1.2 we demonstrate the
event EA in VMCS1→2 but L0 does not trap such event importance of this technique, showing that more naı̈ve
for L1 (i.e., a trap is not specified in VMCS0→1 ). To for- approaches (surveyed below) cause at least a three-fold
ward the event EA to L1 , L0 needs to specify the corre- slowdown of some useful workloads.
sponding trap in VMCS0→2 . In addition, the field used by When no hardware support for memory manage-
L1 to inject events to L2 needs to be merged, as well as ment virtualization was available, a technique known as
the fields used by the processor to specify the exit cause. shadow page tables [15] was used. A guest creates a
For the sake of brevity, we omit some details on how guest page table, which translates guest virtual addresses
specific VMCS fields are merged. For the complete de- to guest physical addresses. Based on this table, the hy-
tails, the interested reader is encouraged to refer to the pervisor creates a new page table, the shadow page ta-
KVM source code [29]. ble, which translates guest virtual addresses directly to
the corresponding host physical address [3, 6]. The hy-
pervisor then runs the guest using this shadow page table
3.2.3 VMEntry and VMExit Emulation instead of the guest’s page table. The hypervisor has to
trap all guest paging changes, including page fault excep-
In nested environments, switches from L1 to L2 and back tions, the INVLPG instruction, context switches (which
must be emulated. When L2 is running and a VMExit cause the use of a different page table) and all the guest
occurs there are two possible handling paths, depending updates to the page table.
on whether the VMExit must be handled only by L0 or To improve virtualization performance, x86 architec-
must be forwarded to L1 . tures recently added two-dimensional page tables [13]—
When the event causing the VMExit is related to L0 a second translation table in the hardware MMU. When
only, L0 handles the event and resumes L2 . This kind of translating a guest virtual address, the processor first uses
event can be an external interrupt, a non-maskable inter- the regular guest page table to translate it to a guest phys-
rupt (NMI) or any trappable event specified in VMCS0→2 ical address. It then uses the second table, called EPT by
that was not specified in VMCS1→2 . From L1 ’s perspec- Intel (and NPT by AMD), to translate the guest physi-
tive this event does not exist because it was generated cal address to a host physical address. When an entry
outside the scope of L1 ’s virtualized environment. By is missing in the EPT table, the processor generates an
analogy to the non-nested scenario, an event occurred at EPT violation exception. The hypervisor is responsible
the hardware level, the CPU transparently handled it, and for maintaining the EPT table and its cache (which can
the hypervisor continued running as before. be flushed with INVEPT), and for handling EPT viola-
The second handling path is caused by events related tions, while guest page faults can be handled entirely by
to L1 (e.g., trappable events specified in VMCS1→2 ). the guest.
In this case L0 forwards the event to L1 by copying The hypervisor, depending on the processors capabil-
VMCS 0→2 fields updated by the processor to VMCS 1→2 ities, decides whether to use shadow page tables or two-
and resuming L1 . The hypervisor running in L1 believes dimensional page tables to virtualize the MMU. In nested
there was a VMExit directly from L2 to L1 . The L1 hy- environments, both hypervisors, L0 and L1 , determine
pervisor handles the event and later on resumes L2 by independently the preferred mechanism. Thus, L0 and
executing vmresume or vmlaunch, both of which will L1 hypervisors can use the same or a different MMU
be emulated by L0 . virtualization mechanism. Figure 4 shows three differ-
5
ent nested MMU virtualization models. ates an additional table, EPT1→2 , to run L2 , because L0
exposed a virtualized EPT capability to L1 . The L0 hy-
pervisor could then compress EPT0→1 and EPT1→2 into
a single EPT0→2 table as shown in Figure 4. Then L0
could run L2 using EPT0→2 , which translates directly
from the L2 guest physical address to the L0 host physi-
cal address, reducing the number of page fault exits and
improving nested virtualization performance. In Sec-
tion 4.1.2 we demonstrate more than a three-fold speedup
of some useful workloads with multi-dimensional page
tables, compared to shadow-on-EPT.
Figure 4: MMU alternatives for nested virtualization The L0 hypervisor launches L2 with an empty EPT0→2
table, building the table on-the-fly, on L2 EPT-violation
Shadow-on-shadow is used when the processor does exits. These happen when a translation for a guest phys-
not support two-dimensional page tables, and is the least ical address is missing in the EPT table. If there is no
efficient method. Initially, L0 creates a shadow page ta- translation in EPT1→2 for the faulting address, L0 first
ble to run L1 (SPT0→1 ). L1 , in turn, creates a shadow lets L1 handle the exit and update EPT1→2 . L0 can now
page table to run L2 (SPT1→2 ). L0 cannot use SPT1→2 create an entry in EPT0→2 that translates the L2 guest
to run L2 because this table translates L2 guest virtual physical address directly to the L0 host physical address:
addresses to L1 host physical addresses. Therefore, L0 EPT 1→2 is used to translate the L 2 physical address to a
compresses SPT0→1 and SPT1→2 into a single shadow L1 physical address, and EPT0→1 translates that into the
page table, SPT0→2 . This new table translates directly desired L0 physical address.
from L2 guest virtual addresses to L0 host physical ad- To maintain correctness of EPT0→2 , the L0 hypervisor
dresses. Specifically, for each guest virtual address in needs to know of any changes that L1 makes to EPT1→2 .
SPT 1→2 , L 0 creates an entry in SPT0→2 with the corre- L0 sets the memory area of EPT1→2 as read-only, thereby
sponding L0 host physical address. causing a trap when L1 tries to update it. L0 will then up-
Shadow-on-EPT is the most straightforward approach date EPT0→2 according to the changed entries in EPT1→2 .
to use when the processor supports EPT. L0 uses the EPT L0 also needs to trap all L1 INVEPT instructions, and in-
hardware, but L1 cannot use it, so it resorts to shadow validate the EPT cache accordingly.
page tables. L1 uses SPT1→2 to run L2 . L0 configures By using huge pages [34] to back guest memory, L0
the MMU to use SPT1→2 as the first translation table and can create smaller and faster EPT tables. Finally, to
EPT 0→1 as the second translation table. In this way, the further improve performance, L0 also allows L1 to use
processor first translates from L2 guest virtual address to VPIDs. With this feature, the CPU tags each transla-
L1 host physical address using SPT1→2 , and then trans- tion in the TLB with a numeric virtual-processor id,
lates from the L1 host physical address to the L0 host eliminating the need for TLB flushes on every VMEn-
physical address using the EPT0→1 . try and VMExit. Since each hypervisor is free to choose
Though the Shadow-on-EPT approach uses the EPT these VPIDs arbitrarily, they might collide and therefore
hardware, it still has a noticeable overhead due to page L0 needs to map the VPIDs that L1 uses into valid L0
faults and page table modifications in L2 . These must VPIDs.
be handled in L1 , to maintain the shadow page ta-
ble. Each of these faults and writes cause VMExits
3.4 I/O: Multi-level Device Assignment
and must be forwarded from L0 to L1 for handling. In
other words, Shadow-on-EPT is slow for the exactly the I/O is the third major challenge in server virtualization.
same reasons that Shadow itself was slow for single-level There are three approaches commonly used to provide
virtualization—but it is even slower because nested exits I/O services to a guest virtual machine. Either the hyper-
are slower than non-nested exits. visor emulates a known device and the guest uses an un-
In multi-dimensional page tables, as in two- modified driver to interact with it [47], or a para-virtual
dimensional page tables, each level creates its own sepa- driver is installed in the guest [6, 42], or the host assigns
rate translation table. For L1 to create an EPT table, L0 a real device to the guest which then controls the device
exposes EPT capabilities to L1 , even though the hard- directly [11, 31, 37, 52, 53]. Device assignment generally
ware only provides a single EPT table. provides the best performance [33, 38, 53], since it mini-
Since only one EPT table is available in hardware, the mizes the number of I/O-related world switches between
two EPT tables should be compressed into one: Let us the virtual machine and its hypervisor, and although it
assume that L0 runs L1 using EPT0→1 , and that L1 cre- complicates live migration, device assignment and live
6
migration can peacefully coexist [26, 28, 54]. Memory-mapped I/O (MMIO) and Port I/O (PIO) for
These three basic I/O approaches for a single-level a nested guest work the same way they work for a single-
guest imply nine possible combinations in the two-level level guest, without incurring exits on the critical I/O
nested guest case. Of the nine potential combinations path [53].
we evaluated the more interesting cases, presented in Ta-
ble 1. Implementing the first four alternatives is straight-
forward. We describe the last option, which we call
3.5 Micro Optimizations
multi-level device assignment, below. Multi-level de- There are two main places where a guest of a nested hy-
vice assignment lets the L2 guest access a device di- pervisor is slower than the same guest running on a bare-
rectly, bypassing both hypervisors. This direct device metal hypervisor. First, the transitions between L1 and
access requires dealing with DMA, interrupts, MMIO, L2 are slower than the transitions between L0 and L1 .
and PIOs [53]. Second, the exit handling code running in the L1 hyper-
visor is slower than the same code running in L0 . In this
I/O virtualization method I/O virtualization method section we discuss these two issues, and propose opti-
between L0 & L1 between L1 & L2 mizations that improve performance. Since we assume
Emulation Emulation that both L1 and L2 are unmodified, these optimizations
Para-virtual Emulation require modifying L0 only. We evaluate these optimiza-
Para-virtual Para-virtual tions in the evaluation section.
Device assignment Para-virtual
Device assignment Device assignment
3.5.1 Optimizing transitions between L1 and L2
Table 1: I/O combinations for a nested guest As explained in Section 3.2.3, transitions between L1
and L2 involve an exit to L0 and then an entry. In
Device DMA in virtualized environments is compli- L0 , most of the time is spent merging the VMCS’s. We
cated, because guest drivers use guest physical addresses, optimize this merging code to only copy data between
while memory access in the device is done with host VMCS ’s if the relevant values were modified. Keeping
physical addresses. The common solution to the DMA track of which values were modified has an intrinsic cost,
problem is an IOMMU [2, 11], a hardware component so one must carefully balance full copying versus partial
which resides between the device and main memory. It copying and tracking. We observed empirically that for
uses a translation table prepared by the hypervisor to common workloads and hypervisors, partial copying has
translate the guest physical addresses to host physical a lower overhead.
addresses. IOMMUs currently available, however, only VMCS merging could be further optimized by copy-
support a single level of address translation. Again, we ing multiple VMCS fields at once. However, according to
need to compress two levels of translation tables onto the Intel’s specifications, reads or writes to the VMCS area
one level available in hardware. must be performed using vmread and vmwrite in-
For modified guests this can be done using a paravir- structions, which operate on a single field. We empiri-
tual IOMMU: the code in L1 which sets a mapping on cally noted that under certain conditions one could ac-
the IOMMU from L2 to L1 addresses is replaced by a cess VMCS data directly without ill side-effects, bypass-
hypercall to L0 . L0 changes the L1 address in that map- ing vmread and vmwrite and copying multiple fields
ping to the respective L0 address, and puts the resulting at once with large memory copies. However, this opti-
mapping (from L2 to L0 addresses) in the IOMMU. mization does not strictly adhere to the VMX specifica-
A better approach, one which can run unmodified tions, and thus might not work on processors other than
guests, is for L0 to emulate an IOMMU for L1 [5]. L1 the ones we have tested.
believes that it is running on a machine with an IOMMU, In the evaluation section, we show that this opti-
and sets up mappings from L2 to L1 addresses on it. L0 mization gives a significant performance boost in micro-
intercepts these mappings, remaps the L1 addresses to benchmarks. However, it did not noticeably improve the
L0 addresses, and builds the L2 -to-L0 map on the real other, more typical, workloads that we have evaluated.
IOMMU.
In current x86 architecture, interrupts always cause a 3.5.2 Optimizing exit handling in L1
guest exit to L0 , which proceeds to forward the interrupt
to L1 . L1 will then inject it into L2 . The EOI (end of The exit-handling code in the hypervisor is slower when
interrupt) will also cause a guest exit. In Section 4.1.1 we run in L1 than the same code running in L0 . The main
discuss the slowdown caused by these interrupt-related cause of this slowdown are additional exits caused by
exits, and propose ways to avoid it. privileged instructions in the exit-handling code.
7
In Intel VMX, the privileged instructions vmread and 4.1 Macro Workloads
vmwrite are used by the hypervisor to read and mod-
ify the guest and host specification. As can be seen in kernbench is a general purpose compilation-type
Section 4.3, these cause L1 to exit multiple times while benchmark that compiles the Linux kernel multiple
it handles a single L2 exit. times. The compilation process is, by nature, CPU- and
memory-intensive, and it also generates disk I/O to load
In contrast, in AMD SVM, guest and host specifica- the compiled files into the guest’s page cache.
tions can be read or written to directly using ordinary SPECjbb is an industry-standard benchmark de-
memory loads and stores. The clear advantage of that signed to measure the server-side performance of Java
model is that L0 does not intervene while L1 modifies run-time environments. It emulates a three-tier system
L2 specifications. Removing the need to trap and emu- and is primarily CPU-intensive.
late special instructions reduces the number of exits and
We executed kernbench and SPECjbb in four se-
improves nested virtualization performance.
tups: host, single-level guest, nested guest, and nested
One thing L0 can do to avoid trapping on every guest optimized with direct read and write (DRW) as de-
vmread and vmwrite is binary translation [3] of prob- scribed in Section 3.5.2. The optimizations described
lematic vmread and vmwrite instructions in the L1 in Section 3.5.1 did not make a significant difference to
instruction stream, by trapping the first time such an in- these benchmarks, and are thus omitted from the results.
struction is called and then rewriting it to branch to a We used KVM as both L0 and L1 hypervisor with multi-
non-trapping memory load or store. To evaluate the po- dimensional paging. The results are depicted in Table 2.
tential performance benefit of this approach, we tested a
modified L1 that directly reads and writes VMCS1→2 in Kernbench
memory, instead of using vmread and vmwrite. The Host Guest Nested NestedDRW
performance of this setup, which we call DRW (direct Run time 324.3 355 406.3 391.5
read and write) is described in the evaluation section. STD dev. 1.5 10 6.7 3.1
% overhead
vs. host - 9.5 25.3 20.7
% overhead
4 Evaluation vs. guest - - 14.5 10.3
%CPU 93 97 99 99
SPECjbb
We start the evaluation and analysis of nested virtual-
ization with macro benchmarks that represent real-life Host Guest Nested NestedDRW
workloads. Next, we evaluate the contribution of multi- Score 90493 83599 77065 78347
level device assignment and multi-dimensional paging to STD dev. 1104 1230 1716 566
nested virtualization performance. Most of our experi- % degradati-
ments are executed with KVM as the L1 guest hyper- on vs. host - 7.6 14.8 13.4
visor. In Section 4.2 we present results with VMware % degradati-
Server as the L1 guest hypervisor. on vs. guest - - 7.8 6.3
%CPU 100 100 100 100
We then continue the evaluation with a synthetic,
worst-case micro benchmark running on L2 which Table 2: kernbench and SPECjbb results
causes guest exits in a loop. We use this synthetic, worst-
case benchmark to understand and analyze the overhead We compared the impact of running the workloads in a
and the handling flow of a single L2 exit. nested guest with running the same workload in a single-
Our setup consisted of an IBM x3650 machine booted level guest, i.e., the overhead added by the additional
with a single Intel Xeon 2.9GHz core and with 3GB of level of virtualization. For kernbench, the overhead
memory. The host OS was Ubuntu 9.04 with a kernel of nested virtualization is 14.5%, while for SPECjbb the
that is based on the KVM git tree version kvm-87, with score is degraded by 7.82%. When we discount the
our nested virtualization support added. For both L1 and Intel-specific vmread and vmwrite overhead in L1 ,
L2 guests we used an Ubuntu Jaunty guest with a kernel the overhead is 10.3% and 6.3% respectively.
that is based on the KVM git tree, version kvm-87. L1 To analyze the sources of overhead, we examine the
was configured with 2GB of memory and L2 was config- time distribution between the different levels. Figure 5
ured with 1GB of memory. For the I/O experiments we shows the time spent in each level. It is interesting to
used a Broadcom NetXtreme 1Gb/s NIC connected via compare the time spent in the hypervisor in the single-
crossover-cable to an e1000e NIC on another machine. level case with the time spent in L1 in the nested guest
8
case, since both hypervisors are expected to do the same Guest
L1
work. The times are indeed similar, although the L1 hy- 100 L0
CPU mode switch
pervisor takes more cycles due to cache pollution and
CPU Cycles
120,000
overall cycles for the nested-guest case. In other words, 100,000
9
throughput (Mbps)
%cpu 1000
1,000 100 900
Throughput (Mbps)
900 800
800 80 700
throughput (Mbps)
700 600
600 60 500
% cpu
500 400
400 40 300 L0 (bare metal)
300 200 L2 (direct/direct)
L2 (direct/virtio)
200 20 100
100 16 32 64 128 256 512
0 0 Message size (netperf -m)
na si si si ne ne ne ne ne
ti ve enmgle vni gle dni gle emsted visted visted disted disted
ulalev rtio lev rect lev ula gu rtio gu rtio gu rect gu rect gu
e
tio l g e e e e e e e
l g acc l g tio st / emst / vi st / v st / d st
n ue ue es ue n / irt ire
st st s st em ulat rtio io ct
ula io
tio n Figure 8: Performance of netperf with interrupt-less
n
network driver
Multi−dimensional paging
the architecture given us a way to avoid these exits—by 2.0
10
ing is used, only an access to a guest physical page that is overhead of handling guest exits in L0 and L1 . Based on
not mapped in the EPT table will cause an EPT violation this definition, this cpuid micro benchmark is a worst
exit. Therefore the impact of multi-dimensional paging case workload, since L2 does virtually nothing except
depends on the number of guest page faults, which is a generate exits. We note that cpuid cannot in the gen-
property of the workload. The improvement is startling eral case be handled by L0 directly, as L1 may wish to
in benchmarks such as kernbench with a high number modify the values returned to L2 .
of page faults, and is less pronounced in workloads that Figure 10 shows the number of CPU cycles required to
do not incur many page faults. execute a single cpuid instruction. We ran the cpuid
instruction 4 ∗ 106 times and calculated the average num-
ber of cycles per iteration. We repeated the test for the
4.2 VMware Server as a Guest Hypervisor following setups: 1. native, 2. running cpuid in a single
We also evaluated VMware as the L1 hypervisor to ana- level guest, and 3. running cpuid in a nested guest with
lyze how a different guest hypervisor affects nested vir- and without the optimizations described in Section 3.5.
tualization performance. We used the hosted version, For each execution, we present the distribution of the cy-
VMWare Server v2.0.1, build 156745 x86-64, on top of cles between the levels: L0 , L1 , L2 . CPU mode switch
Ubuntu based on kernel 2.6.28-11. We intentionally did stands for the number of cycles spent by the CPU when
not install VMware tools for the L2 guest, thereby in- performing a VMEntry or a VMExit. On bare metal
creasing nested virtualization overhead. Due to similar cpuid takes about 100 cycles, while in a virtual ma-
results obtained for VMware and KVM as the nested hy- chine it takes about 2,600 cycles (Figure 10, column 1),
pervisor, we show only kernbench and SPECjbb re- about 1,000 of which is due to the CPU mode switch-
sults below. ing. When run in a nested virtual machine it takes about
58,000 cycles (Figure 10, column 2).
Benchmark % overhead vs. single-level guest
60,000
kernbench 14.98
50,000
SPECjbb 8.85
40,000
CPU Cycles
11
7. L1 executes a resume of L2 5 Discussion
8. CPU traps and switches to root mode L0
9. L0 switches state from running L1 to running L2 In nested environments we introduce a new type of work-
10. CPU switches to guest mode L2 load not found in single-level virtualization: the hypervi-
sor as a guest. Traditionally, x86 hypervisors were de-
signed and implemented assuming they will be running
In general, step 5 can be repeated multiple times. Each directly on bare metal. When they are executed on top of
iteration consists of a single VMExit from L1 to L0 . another hypervisor this assumption no longer holds and
The total number of exits depends on the specific im- the guest hypervisor behavior becomes a key factor.
plementation of the L1 hypervisor. A nesting-friendly With a nested L1 hypervisor, the cost of a single L2
hypervisor will keep privileged instructions to a mini- exit depends on the number of exits caused by L1 dur-
mum. In any case, the L1 hypervisor must interact with ing the L2 exit handling. A nesting-friendly L1 hyper-
VMCS 1→2 , as described in Section 3.2.2. In the case of visor should minimize this critical chain to achieve bet-
cpuid, in step 5, L1 reads 7 fields of VMCS1→2 , and ter performance, for example by limiting the use of trap-
writes 4 fields to VMCS1→2 , which ends up as 11 VMEx- causing instructions in the critical path.
its from L1 to L0 . Overall, for a single L2 cpuid exit Another alternative for reducing this critical chain is to
there are 13 CPU mode switches from guest mode to para-virtualize the guest hypervisor, similar to OS para-
root mode and 13 CPU mode switches from root mode virtualization [6, 50, 51]. While this approach could re-
to guest mode, specifically in steps: 2, 4, 5b, 5d, 8, 10. duce L0 intervention when L1 virtualizes the L2 envi-
The number of cycles the CPU spends in a single ronment, the work being done by L0 to virtualize the
switch to guest mode plus the number of cycles to switch L1 environment will still persist. How much this tech-
back to root mode, is approximately 1,000. The total nique can help depends on the workload and on the spe-
CPU switching cost is therefore around 13,000 cycles. cific approach used. Taking as a concrete example the
conversion of vmreads and vmwrites to non-trapping
The other two expensive steps are 3 and 9. As de- load/stores, para-virtualization could reduce the over-
scribed in Section 3.5, these switches can be optimized. head for kernbench from 14.5% to 10.3%.
Indeed as we show in Figure 10, column 3, using various
optimizations we can reduce the virtualization overhead
by 25%, and by 80% when using non-trapping vmread 5.1 Architectural Overhead
and vmwrite instructions.
Part of the overhead introduced with nested virtualization
By avoiding traps on vmread and vmwrite (Fig- is due to the architectural design choices of x86 hardware
ure 10, columns 4 and 5), we removed the exits caused virtualization extensions.
by VMCS1→2 accesses and the corresponding VMCS ac-
Virtualization API: Two performance sensitive areas
cess emulation, step 5. This optimization reduced the
in x86 virtualization are memory management and I/O
switching cost by 84.6%, from 13,000 to 2,000.
virtualization. With multi-dimensional paging we com-
While it might still be possible to optimize steps 3 pressed three MMU translation tables onto the two avail-
and 9 further, it is clear that the exits of L1 while han- able in hardware; multi-level device assignment does
dling a single exit of L2 , and specifically VMCS accesses, the same for IOMMU translation tables. Architectural
are a major source of overhead. Architectural support for support for multiple levels of MMU and DMA transla-
both faster world switches and VMCS updates without ex- tion tables—as many tables as there are levels of nested
its will reduce the overhead. hypervisors—will immediately improve MMU and I/O
Examining Figure 10, it seems that handling cpuid virtualization.
in L1 is more expensive than handling cpuid in L0 . Architectural support for delivering interrupts directly
Specifically, in column 3, the nested hypervisor L1 from the hardware to the L2 guest will remove L0 inter-
spends around 5,000 cycles to handle cpuid, while in vention on interrupt delivery and completion, interven-
column 1 the same hypervisor running on bare metal tion which, as we explained in Section 4.1.1, hurts nested
only spends 1500 cycles to handle the same exit (note performance. Such architectural support will also help
that these numbers do not include the mode switches). single-level I/O virtualization performance [33].
The code running in L1 and in L0 is identical; the differ- VMX features such as MSR bitmaps, I/O bitmaps, and
ence in cycle count is due to cache pollution. Running CR masks/shadows [48] proved to be effective in reduc-
the cpuid handling code incurs on average 5 L2 cache ing exit overhead. Any architectural feature that reduces
misses and 2 TLB misses when run in L0 , whereas run- single-level exit overhead also shortens the nested critical
ning the exact same code in L1 incurs on average 400 L2 path. Such features, however, also add implementation
cache misses and 19 TLB misses. complexity, since to exploit them in nested environments
12
they must be properly emulated by L0 hypervisors. Acknowledgments
Removing the (Intel-specific) need to trap on every
vmread and vmwrite instruction will give an imme- The authors would like to thank Alexander Graf and Jo-
diate performance boost, as we showed in Section 3.5.2. erg Roedel, whose KVM patches for nested SVM in-
Same Core Constraint: The x86 trap-and-emulate spired parts of this work. The authors would also like
implementation dictates that the guest and hypervisor to thank Ryan Harper, Nadav Amit, and our shepherd
share each core, since traps are always handled on the Robert English for insightful comments and discussions.
core where they occurred. Due to this constraint, when
the hypervisor handles an exit the guest is temporarily References
stopped on that core. In a nested environment, the L1 [1] Phoenix Hyperspace. http://www.hyperspace.com/.
guest hypervisor will also be interrupted, increasing the [2] A BRAMSON , D., JACKSON , J., M UTHRASANALLUR , S.,
total interruption time of the L2 guest. Gavrilovska, et N EIGER , G., R EGNIER , G., S ANKARAN , R., S CHOINAS , I.,
al., presented techniques for exploiting additional cores U HLIG , R., V EMBU , B., AND W IEGERT, J. Intel virtualiza-
to handle guest exits [19]. According to the authors, for tion technology for directed I/O. Intel Technology Journal 10, 03
(August 2006), 179–192.
a single level of virtualization, they measured 41% aver-
[3] A DAMS , K., AND AGESEN , O. A comparison of software and
age improvements in call latency for null calls, cpuid and hardware techniques for x86 virtualization. SIGOPS Oper. Syst.
page table updates. These techniques could be adapted Rev. 40, 5 (December 2006), 2–13.
for nested environments in order to remove L0 interven- [4] AMD. Secure virtual machine architecture reference manual.
tions and also reduce privileged instructions call laten- [5] A MIT, N., B EN -Y EHUDA , M., AND YASSOUR , B.-A. IOMMU:
cies, decreasing the total interruption time of a nested Strategies for mitigating the IOTLB bottleneck. In WIOSCA ’10:
guest. Sixth Annual Workshop on the Interaction between Operating
Systems and Computer Architecture.
Cache Pollution: Each time the processor switches
[6] BARHAM , P., D RAGOVIC , B., F RASER , K., H AND , S.,
between the guest and the host context on a single core, H ARRIS , T., H O , A., N EUGEBAUER , R., P RATT, I., AND
the effectiveness of its caches is reduced. This phe- WARFIELD , A. Xen and the art of virtualization. In SOSP ’03:
nomenon is magnified in nested environments, due to Symposium on Operating Systems Principles (2003).
the increased number of switches. As was seen in Sec- [7] BAUMANN , A., BARHAM , P., DAGAND , P. E., H ARRIS , T.,
tion 4.3, even after discounting L0 intervention, the L1 I SAACS , R., P ETER , S., ROSCOE , T., S CH ÜPBACH , A., AND
S INGHANIA , A. The multikernel: a new os architecture for scal-
hypervisor still took more cycles to handle an L2 exit able multicore systems. In SOSP ’09: 22nd ACM SIGOPS Sym-
than it took to handle the same exit for the single-level posium on Operating systems principles, pp. 29–44.
scenario, due to cache pollution. Dedicating cores to [8] B ELLARD , F. QEMU, a fast and portable dynamic translator. In
guests could reduce cache pollution [7, 45, 46] and in- USENIX Annual Technical Conference (2005), p. 41.
crease performance. [9] B ELPAIRE , G., AND H SU , N.-T. Hardware architecture for re-
cursive virtual machines. In ACM ’75: 1975 annual ACM con-
ference, pp. 14–18.
[10] B ELPAIRE , G., AND H SU , N.-T. Formal properties of recur-
6 Conclusions and Future Work sive virtual machine architectures. SIGOPS Oper. Syst. Rev. 9, 5
(1975), 89–96.
Efficient nested x86 virtualization is feasible, despite [11] B EN -Y EHUDA , M., M ASON , J., X ENIDIS , J., K RIEGER , O.,
the challenges stemming from the lack of architectural VAN D OORN , L., NAKAJIMA , J., M ALLICK , A., AND WAHLIG ,
support for nested virtualization. Enabling efficient E. Utilizing IOMMUs for virtualization in Linux and Xen. In
OLS ’06: The 2006 Ottawa Linux Symposium, pp. 71–86.
nested virtualization on the x86 platform through multi-
[12] B ERGHMANS , O. Nesting virtual machines in virtualization test
dimensional paging and multi-level device assignment frameworks. Master’s thesis, University of Antwerp, May 2010.
opens exciting avenues for exploration in such diverse [13] B HARGAVA , R., S EREBRIN , B., S PADINI , F., AND M ANNE ,
areas as security, clouds, and architectural research. S. Accelerating two-dimensional page walks for virtualized sys-
We are continuing to investigate architectural and tems. In ASPLOS ’08: 13th intl. conference on architectural sup-
software-based methods to improve the performance port for programming languages and operating systems (2008).
of nested virtualization, while simultaneously exploring [14] C LARK , C., F RASER , K., H AND , S., H ANSEN , J. G., J UL , E.,
L IMPACH , C., P RATT, I., AND WARFIELD , A. Live migration of
ways of building computer systems that have nested vir- virtual machines. In NSDI ’05: Second Symposium on Networked
tualization built-in. Systems Design & Implementation (2005), pp. 273–286.
Last, but not least, while the Turtles project is fairly [15] D EVINE , S. W., B UGNION , E., AND ROSENBLUM , M. Virtu-
mature, we expect that the additional public exposure alization system including a virtual machine monitor for a com-
puter with a segmented architecture. US #6397242, May 2002.
stemming from its open source release will help enhance
[16] F ORD , B., H IBLER , M., L EPREAU , J., T ULLMANN , P., BACK ,
its stability and functionality. We look forward to see- G., AND C LAWSON , S. Microkernels meet recursive virtual ma-
ing in what interesting directions the research and open chines. In OSDI ’96: Second USENIX symposium on Operating
source communities will take it. systems design and implementation (1996), pp. 137–151.
13
[17] G ARFINKEL , T., A DAMS , K., WARFIELD , A., AND F RANKLIN , [37] R AJ , H., AND S CHWAN , K. High performance and scalable I/O
J. Compatibility is not transparency: VMM detection myths and virtualization via self-virtualized devices. In HPDC ’07: Pro-
realities. In HOTOS’07: 11th USENIX workshop on Hot topics ceedings of the 16th international symposium on High perfor-
in operating systems (2007), pp. 1–6. mance distributed computing (2007), pp. 179–188.
[18] G ARFINKEL , T., AND ROSENBLUM , M. A virtual machine in- [38] R AM , K. K., S ANTOS , J. R., T URNER , Y., C OX , A. L., AND
trospection based architecture for intrusion detection. In Network R IXNER , S. Achieving 10Gbps using safe and transparent net-
& Distributed Systems Security Symposium (2003), pp. 191–206. work interface virtualization. In VEE ’09: The 2009 ACM SIG-
[19] G AVRILOVSKA , A., K UMNAR , S., R AJ , H., S CHWAN , K., PLAN/SIGOPS International Conference on Virtual Execution
G UPTA , V., NATHUJI , R., N IRANJAN , R., R ANADIVE , A., AND Environments (March 2009).
S ARAIYA , P. High-performance hypervisor architectures: Virtu- [39] R ILEY, R., J IANG , X., AND X U , D. Guest-transparent pre-
alization in hpc systems. In HPCVIRT ’07: 1st Workshop on vention of kernel rootkits with vmm-based memory shadowing.
System-level Virtualization for High Performance Computing. In Recent Advances in Intrusion Detection, vol. 5230 of Lecture
[20] G EBHARDT, C., AND DALTON , C. Lala: a late launch appli- Notes in Computer Science. 2008, ch. 1, pp. 1–20.
cation. In STC ’09: 2009 ACM workshop on Scalable trusted [40] ROBIN , J. S., AND I RVINE , C. E. Analysis of the intel pen-
computing (2009), pp. 1–8. tium’s ability to support a secure virtual machine monitor. In 9th
[21] G OLDBERG , R. P. Architecture of virtual machines. In Proceed- conference on USENIX Security Symposium (2000), p. 10.
ings of the workshop on virtual computer systems (New York, [41] ROSENBLUM , M. Vmware’s virtual platform: A virtual machine
NY, USA, 1973), ACM, pp. 74–112. monitor for commodity pcs. In Hot Chips 11 (1999).
[22] G OLDBERG , R. P. Survey of virtual machine research. IEEE [42] RUSSELL , R. virtio: towards a de-facto standard for virtual I/O
Computer Magazine (June 1974), 34–45. devices. SIGOPS Oper. Syst. Rev. 42, 5 (2008), 95–103.
[23] G RAF, A., AND ROEDEL , J. Nesting the virtualized world. [43] RUTKOWSKA , J. Subverting vista kernel for fun and profit.
Linux Plumbers Conference, Sep. 2009. Blackhat, Aug. 2006.
[24] H E , Q. Nested virtualization on xen. Xen Summit Asia 2009. [44] S ESHADRI , A., L UK , M., Q U , N., AND P ERRIG , A. Secvisor: a
[25] H UANG , J.-C., M ONCHIERO , M., AND T URNER , Y. Ally: Os- tiny hypervisor to provide lifetime kernel code integrity for com-
transparent packet inspection using sequestered cores. In WIOV modity oses. In SOSP ’07: 21st ACM SIGOPS symposium on
’10: The Second Workshop on I/O Virtualization. Operating systems principles (2007), pp. 335–350.
[26] H UANG , W., L IU , J., KOOP, M., A BALI , B., AND PANDA , D. [45] S HALEV, L., B OROVIK , E., S ATRAN , J., AND B EN -Y EHUDA ,
Nomad: migrating OS-bypass networks in virtual machines. In M. Isostack—highly efficient network processing on dedicated
VEE ’07: 3rd international conference on Virtual execution envi- cores. In USENIX ATC ’10: The 2010 USENIX Annual Technical
ronments (2007), pp. 158–168. Conference (2010).
[27] I NTEL C ORPORATION. Intel 64 and IA-32 Architectures Soft- [46] S HALEV, L., M AKHERVAKS , V., M ACHULSKY, Z., B IRAN , G.,
ware Developers Manual. 2009. S ATRAN , J., B EN -Y EHUDA , M., AND S HIMONY, I. Loosely
coupled tcp acceleration architecture. In HOTI ’06: Proceedings
[28] K ADAV, A., AND S WIFT, M. M. Live migration of direct-access of the 14th IEEE Symposium on High-Performance Interconnects
devices. In First Workshop on I/O Virtualization (WIOV ’08). (Washington, DC, USA, 2006), IEEE Computer Society, pp. 3–8.
[29] K IVITY, A., K AMAY, Y., L AOR , D., L UBLIN , U., AND [47] S UGERMAN , J., V ENKITACHALAM , G., AND L IM , B.-H. Virtu-
L IGUORI , A. KVM: the linux virtual machine monitor. In Ot- alizing I/O devices on VMware workstation’s hosted virtual ma-
tawa Linux Symposium (July 2007), pp. 225–230. chine monitor. In USENIX Annual Technical Conference (2001).
[30] L AUER , H. C., AND W YETH , D. A recursive virtual machine [48] U HLIG , R., N EIGER , G., RODGERS , D., S ANTONI , A. L.,
architecture. In Workshop on virtual computer systems (1973), M ARTINS , F. C. M., A NDERSON , A. V., B ENNETT, S. M.,
pp. 113–116. K AGI , A., L EUNG , F. H., AND S MITH , L. Intel virtualization
[31] L EVASSEUR , J., U HLIG , V., S TOESS , J., AND G ÖTZ , S. Un- technology. Computer 38, 5 (2005), 48–56.
modified device driver reuse and improved system dependability [49] WALDSPURGER , C. A. Memory resource management in
via virtual machines. In OSDI ’04: 6th conference on Symposium VMware ESX server. In OSDI ’02: 5th Symposium on Operating
on Opearting Systems Design & Implementation (2004), p. 2. System Design and Implementation.
[32] L EVASSEUR , J., U HLIG , V., YANG , Y., C HAPMAN , M., [50] W HITAKER , A., S HAW, M., AND G RIBBLE , S. D. Denali: a
C HUBB , P., L ESLIE , B., AND H EISER , G. Pre-virtualization: scalable isolation kernel. In EW ’10: 10th ACM SIGOPS Euro-
Soft layering for virtual machines. In ACSAC ’08: 13th Asia- pean workshop (2002), pp. 10–15.
Pacific Computer Systems Architecture Conference, pp. 1–9.
[51] W HITAKER , A., S HAW, M., AND G RIBBLE , S. D. Scale and
[33] L IU , J. Evaluating standard-based self-virtualizing devices: A performance in the denali isolation kernel. SIGOPS Oper. Syst.
performance study on 10 GbE NICs with SR-IOV support. In Rev. 36, SI (2002), 195–209.
IPDPS ’10: IEEE International Parallel and Distributed Pro-
cessing Symposium (2010). [52] W ILLMANN , P., S HAFER , J., C ARR , D., M ENON , A., R IXNER ,
S., C OX , A. L., AND Z WAENEPOEL , W. Concurrent direct net-
[34] NAVARRO , J., I YER , S., D RUSCHEL , P., AND C OX , A. Prac- work access for virtual machine monitors. In High Performance
tical, transparent operating system support for superpages. In Computer Architecture, 2007. HPCA 2007. IEEE 13th Interna-
OSDI ’02: 5th symposium on Operating systems design and im- tional Symposium on (2007), pp. 306–317.
plementation (2002), pp. 89–104.
[53] YASSOUR , B.-A., B EN -Y EHUDA , M., AND WASSERMAN , O.
[35] O SISEK , D. L., JACKSON , K. M., AND G UM , P. H. Esa/390 Direct device assignment for untrusted fully-virtualized virtual
interpretive-execution architecture, foundation for vm/esa. IBM machines. Tech. rep., IBM Research Report H-0263, 2008.
Systems Journal 30, 1 (1991).
[54] Z HAI , E., C UMMINGS , G. D., AND D ONG , Y. Live migration
[36] P OPEK , G. J., AND G OLDBERG , R. P. Formal requirements for with pass-through device for Linux VM. In OLS ’08: The 2008
virtualizable third generation architectures. Commun. ACM 17, 7 Ottawa Linux Symposium (July 2008), pp. 261–268.
(July 1974), 412–421.
14