Optimizing TLB Shootdown Efficiency
Optimizing TLB Shootdown Efficiency
Nadav Amit
VMware Research
CR3
0s adrs0 pcid0 0s adrs1 pcid0
reasonable effort. (4) change CR3 back
primary secondary
to primary hierarchy
PGDs
Approximation Let us first rule out the naive ap- … …
proach to approximate direct TLB insertion by: (1) set-
PUDs
ting a PTE e; (2) accessing the page and thus prompt- … … …
ing hardware to load the corresponding mapping
PMDs
m into the TLB and to set e’s access bit; and then … … …
(3) having the OS clear e’s access bit. This approach is
buggy due to the time window between the second
PTEs
… … …
and third items, which allows other cores to cache
m in their TLBs before the bit is cleared, resulting in (2) wire both hierarchies (5) zero
[primary bit is 0] page secondary
a false sharing indications that the page is private. PTE
Shootdown will then be erroneously skipped.
(3) read page
We resolve this problem and avoid the above race CR3 change [translation loaded to TLB from
phys. address secondary CR3, primary bit remains 0]
by using Intel’s address space IDs, which is known access bit
as process-context identifiers (PCIDs) [25]. PCIDs PTE
enable TLBs to hold mappings of multiple address
spaces by associating every cached PTE with a PCID Figure 1: Direct TLB insertion using a secondary hierarchy.
of its address space. The PCID of the current address
space is stored in the same register as the pointer to secondary spaces, leaving the corresponding access
the root of the page table hierarchy (CR3), and TLB bit in the primary hierarchy clear (Step 2).
entries are associated with this PCID when they are Then, ABIS reads from the page. Because the asso-
cached. The CPU uses for address translation only ciated mapping is currently missing from the TLB (a
PTEs whose PCID matches the current one. This page fault fired), and because CR3 currently points
feature is intended to allow OSes to avoid global to the secondary space, reading the page prompts
TLB invalidations during context switch and reduce the hardware to walk the secondary hierarchy and
the number of TLB misses. to insert the appropriate translation to the TLB, leav-
PCID is not currently used by Linux due to the ing the primary bit clear (Step 3). Importantly, the
limited number of supported address spaces and inserted translation is valid and usable within the
questionable performance gains from TLB miss re- primary space, because both spaces have the same
duction. We indeed exploit this feature in a different PCID and point to the same physical page using the
manner. Nevertheless, our use does not prevent or same virtual address. This approach eliminates he
limit future PCID support in the OS. aforementioned race: no other core is able to access
The technique ABIS employs to provide direct TLB the secondary space, as it is private to the core.
insertion is depicted in Figure 1. Upon initialization, After reading the page, ABIS loads the primary
ABIS preallocates for each core a “secondary” page- hierarchy back to CR3, to allow the thread to continue
table hierarchy, which consists of four pages, one as usual (Step 4). It then clears the PTE from the
for each level of the hierarchy. The uppermost level secondary space, thereby preventing further use of
of the page-table (PGD) is then set to point to the translation data from the secondary hierarchy that
kernel mappings (like all other address spaces). The may have been cached in the hardware page-walk
other three pages are not connected at this stage to cache (PWC). If the secondary tables are used by the
the hierarchy, but wired dynamically later according CPU for translation, no valid PTE will be found and
to the address of the PTE that is inserted to the TLB. the CPU will restart a page-walk from the root entry.
While executing, the currently running thread T Finally, using our “software-PTE” (SPTE) data
occasionally experiences page faults, notably due structure (§4.3), ABIS associates the faulting PTE e
to demand paging. When a page fault fires, the OS with the current core c that has just resolved e. When
handler is invoked and locks the PT that holds the the time comes to flush e, if ABIS determines that e
faulting PTE—no other core will simultaneously is still private to c, it will invalidate e on c only, thus
handle the same fault. avoiding the shootdown overhead.
At this point, ABIS loads the secondary space to
CR3 along with a PCID equal to that of T (Step 1 in Coexisting with Linux Linux reads and clears ar-
Figure 1). After, ABIS wires the virtual-to-physical chitectural access bits (hwA-s) via a small API, allow-
mapping of the target page in both primary and ing us to easily mask these bits while making sure
Flush all True SPTE.CPU = False Flush Figure 4: Software PTE (SPTE) and its association to the page
CPUs all SPTE.CPU table through the meta-data of the page-table frame.
13
6
0.8 use ZRAM, a compressed RAM block device, which
0.6
is used by Google Chrome OS and Ubuntu. This
0.6 853
0.4
device latency is similar to that of emerging non-
0.4 volatile memory modules. In our test, we disable
0.2 0.2 memory deduplication and deep sleep states which
297 may increase the variance of the results.
632 305 149 160
0 0
migrate
cow-seq-mt
cow-rand-mt
mmap-read
msync-mt
anon-r-seq
5.1 VM-Scalability
We use the vm-scalability test suite [34], which is
used by Linux kernel developers to exercise the
Figure 5: Normalized runtime and number of TLB shootdowns in kernel memory management mechanisms, test their
ABIS when running vm-scalability benchmarks. The numbers correctness and measure their performance.
above the bars indicate the baseline (left) runtime in seconds and We measure ABIS performance by running bench-
(right) rate of TLB shootdowns in thousands/second. marks that experience high number of TLB shoot-
downs.2 To run the benchmarks in a reasonable time,
we limit the amount of memory each test consumes
table, one holding the CPU architectural PTEs and
to 32GB. Figure 5 presents the measured speedup,
the second holding the corresponding SPTEs, each
the runtime, the relative number of sent TLB shoot-
in a fixed offset from its PTE. While this scheme is
downs and their rate. We now discuss these results.
simple, it wastes memory as it requires the SPTE to
be the same size as a PTE (8B), when in fact SPTE Migrate. This benchmark reads a memory mapped
only occupies two bytes. file and waits while the OS is instructed to migrate
We therefore allocate an SPT separately during the the process memory between NUMA nodes. During
PT construction, and set a pointer to the SPT in the migration, we set the benchmark to perform a busy-
PT page-frame meta-data (page struct). Linux can wait loop to practice TLB flushes. We present the
quickly retrieve this meta-data, allowing us to access time that a 1TB memory migration would take. ABIS
the SPTE of a certain PTE with small overhead. The reduces runtime by 44% and shootdowns by 92%.
SPTE pointer does not increase the page-frame meta- Multithreaded copy-on-write (cow-mt). Multiple
data, as it is set in an unused PT meta-data field threads read and write a private memory mapped
(second quadword). The SPT therefore increases file. Each write causes the kernel to copy the original
page table memory consumption by 25%. ABIS page, update the PTE to point to the copy, and flush
prevents races during SPT changes by protecting it the TLB. ABIS prevents over 97% of the shootdowns,
with the same lock that is used to protect PT changes. reducing runtime by 20% for sequential memory
It is noteworthy that although SPT management accesses and 15% for random by avoiding over 97%.
introduces a overhead, it is negligible relatively to
Memory mapped reads (mmap-read). Multiple
other overheads in the workloads we evaluated.
processes read a big sparse memory mapped file. As
a result, memory pressure builds up, and memory
5. Evaluation
is reclaimed. While almost all the shootdowns are
We implemented a fully-functional prototype of eliminated, the runtime is not affected, as apparently
the system, ABIS, which is based on Linux 4.5. there are more significant overheads, specifically
As a baseline system for comparison we use the those of the page frame reclamation algorithm.
same version of Linux, which includes recent TLB
shootdown optimizations. We run each test 5 times Multithreaded msync (msync-mt). Multiple threads
and report the average result. Our testbed consists access a memory mapped file and call the msync
of a two-socket Dell PowerEdge R630 with Intel 24- system-call to flush the memory changes to the file.
cores Haswell EP CPUs. We enable x2APIC cluster- msync can cause an overwhelming number of flushes,
mode, which speeds up IPI delivery. as the OS clears the dirty-bit. ABIS eliminates 98%
In our system we disable transparent huge pages of the shootdowns but does not reduce the runtime,
(THP), which may cause frequent full TLB flushes, in- as file system overhead appears to be the main
crease the TLB miss-rate [4] and introduce additional performance bottleneck.
overheads [26]. In practice, when THP is enabled, 2 We find that due to some benchmarks practice unrealistic
ABIS still shows benefit when small pages are used scenarios. Our revised tests are released with ABIS code.
100 500
1.3 80
speedup
80 400
1.2 60
60 300
1.1 40 baseline - send
40 200
ABIS - send
baseline 1 20 baseline - receive
20 ABIS - receive 100
ABIS
speedup
0 0.9 0 0
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
cores [#] cores [#]
(a) Throughput (b) TLB shootdowns
Figure 6: Execution of an Apache web server which serves the Wrk workload generator.
Anonymous memory read (anon-r-seq). To evalu- when all cores are used. Executing the benchmark
ate ABIS overheads we run a benchmark that per- reveals that the effect of ABIS on performance is
forms sequential anonymous memory reads and inconsistent when the number of cores is low, as
does not cause TLB shootdowns. This benchmark’s ABIS causes slowdown of up to 8% and speedups
runtime is 9% longer using ABIS. Profiling the bench- of to 42%. Figure 6b presents the number of TLB
mark shows that the software TLB manipulations shootdown that are sent and received in the baseline
consume 9% of the runtime, suggesting that hard- system and ABIS. As shown, in the baseline sys-
ware enhancements to manipulate the TLB can elim- tem, as more cores are used, the amount of sent TLB
inate most of the overheads. shootdowns becomes almost identical to the number
of requests that Apache serves. ABIS reduces the
5.2 Apache Web Server number of both sent and received shootdowns by
Apache is the most widely used web server soft- up to 90% as it identifies that PTEs are private and
ware. In our tests, we use Apache v2.4.18 and enable that local invalidation would suffice.
buffered server logging for more efficient disk ac-
cesses. We use the multithreaded Wrk workload 5.3 PBZIP2
generator to create web requests [50], and set it to Parallel bzip2 (PBZIP2) is a multithreaded imple-
repeatedly request the default Apache web page for mentation of the bzip2 file compressor [20]. In this
30 seconds, using 400 connections and 6 threads. benchmark we evaluate the effect of reclamation
We use the same server for both the generator and due to memory pressure on PBZIP2, which in itself
Apache, and isolate each one on a set of cores. We does not cause many TLB flushes. We use PBZIP2 to
ensure that the generator is unaffected by ABIS. compress the Linux 4.4 tar file. We configured the
Apache provides several multi-processing mod- benchmark to read the input file into RAM and split
ules. We use the default “mpm_event” module, it between processors using 500k block size. We run
which spawns multiple processes, each of which PBZIP2 in a container and limit its memory to 300MB
runs multiple threads. Apache serves each request to induce swap activity. This activity causes the in-
by creating a memory mapping of the requested validation of long-lived idle mappings as inactive
file, sending its content and unmapping it. This be- memory is reclaimed.
havior effectively causes frequent invalidations of The time of compression is shown in Figure 7a.
short-lived mappings. In the baseline system, the in- ABIS outperforms Linux by up to 12%, and the
validation also requires expensive TLB shootdowns speedup grows with the number of cores. Figure 7b
to the cores that run other threads of the Apache presents the number of TLB shootdowns per second
process. Effectively, when Apache serves concurrent when this benchmark runs. The baseline Linux
requests using multiple threads, it triggers a TLB system sends nearly 200k shootdowns regardless of
shootdown for each request that it serves. the number of threads, and the different shootdown
Figure 6a depicts the number of requests per sec- send rate is merely due to the shorter runtime
ond that are served when the server runs on different when the number of cores is higher. The number
number of cores. ABIS improves performance by 12% of received shootdowns in the baseline system is
400
8
speedup
1.08 20
300
7 1.06 15
200
1.04 10
6
1.02 5 100
5 1 0 0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
threads [#] threads [#]
(a) Runtime (b) Remote shootdowns.
Figure 7: Execution of PBZIP2 when compressing the Linux kernel. The process memory is limited to 300MB to practice page
reclamation.
49 0.8
1.02
10
1.01 14
33 15 29
32
0.6
7. Hardware Support
37 65
1 Although our system saves most of the TLB shoot-
0.4
0.99 0 1
downs, it does introduce some overheads. Hardware
0.98
1 0.2
support that would allow privileged OSes to insert
0.97 1 6 1 7
2 1 PTEs directly to the TLB without setting the access-
7
0.96 0 bit would eliminate most of ABIS’s overhead. Such
blackscholes
bodytrack
canneal
dedup
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
an enhancement should be easy to implement as we
achieve an equivalent behavior in software.
ABIS would able to save even more TLB flushes if
CPUs avoid setting the PTE access-bit after the PTE
Figure 8: Normalized runtime and number of TLB shootdowns is cached in the TLBs. We encountered, however,
in ABIS when running PARSEC benchmarks. The numbers in situations where such events occur. It appears
above the bars indicate the baseline (left) runtime in seconds and that when Intel CPUs set the PTE dirty-bit due to
(right) rate of TLB shootdowns in thousands/second. write access, they also set the access-bit, even if the
PTE is already cached in the TLB. Similarly, before a
CPU triggers a page-fault, it performs a page-walk
core and performs TLB flushes accordingly [48]. to retrieve the updated PTE from memory and may
Teller et al. proposed that OSes save a version count set the access-bit even if the PTE disallows access.
for each PTE, to be used by hardware to perform TLB Since x86 CPUs invalidate the PTE immediately after,
invalidations only when memory is addressed via a before invoking the page-fault exception handler,
stale TLB entry [39]. Li et al. eliminate unwarranted setting the access-bit is unnecessary.
shootdowns of PTEs that are only used by a single CPUs should not invalidate the TLB unnecessarily,
core by extending PTEs to accommodate the core as such invalidations hurt performance regardless of
that first accessed a page, enhancing the CPU to track ABIS. ABIS is further affected, as these invalidations
whether a PTE is private and avoiding shootdowns cause the the access-bit to be set again when the
accordingly [27]. CPU re-caches the PTE. We found that Intel CPUs
These studies present compelling evaluation (unlike AMD CPUs) may perform full TLB flushes
results; however, they require intrusive micro- when virtual machines invalidate huge pages that
architecture changes, which CPU vendors are appar- are backed by small host pages.
ently reluctant to introduce, presumably due to a
history of TLB bugs [1, 16, 17, 35, 46]. 8. Conclusion
Software Solutions. To avoid unnecessary recur- We have presented two new software techniques
that prevent TLB shootdowns in common cases,
ring TLB flushes of invalidated PTEs, Uhlig tracks
without replicating the mapping structures and
TLB versions and avoids shootdowns when the re-
mote cores already performed full TLB flushes after without incurring more page-faults. We have shown
the PTE changed [43, 44]. However, the potential of its benefits in a variety of workloads. While our
this approach is limited since even when TLB invali- system introduces overheads in certain cases, these
dations are batched, the TLB is flushed shortly after can be reduced by minor CPU enhancements. Our
study suggests that providing OSes better control
the last PTE is modified.
over TLBs may be an efficient and simple way to
An alternative approach for reducing TLB flushes
is to require applications to inform the OS how mem- reduce TLB coherency overheads.
ory is used or to control TLB flushes explicitly. Corey
Availability
OS avoids TLB shootdowns of private PTEs by re-
quiring that user applications define which memory The source code is publicly available at:
ranges are private and which are shared [10]. C4 uses http://nadav.amit.to/publications/tlb.
an enhanced Linux version that allows applications
to control TLB invalidations [40]. These systems, Acknowledgment
however, place an additional burden on application This work could not have been done without the
writers. Finally, we should note that reducing the continued support of Dan Tsafrir and Assaf Schuster.
number of memory mapping changes, for example I also thank the paper reviewers and the shepherd
by improving the memory reclamation policy, can Jean-Pierre Lozi.
[9] Daniel Bovet and Marco Cesati. Understanding the [23] Dave Hansen. Patch: x86: set TLB flush tunable
Linux Kernel, Third Edition. O’Reilly & Associates, Inc., to sane value. https://patchwork.kernel.org/
2005. patch/4460841/, 2014.
[10] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yan- [24] Xeon phi processor. http://www.intel.com/
dong Mao, M Frans Kaashoek, Robert Morris, Alek- content/www/us/en/processors/xeon/xeon-phi-
sey Pesterev, Lex Stein, Ming Wu, Yue-hua Dai, et al. detail.html.
Corey: An operating system for many cores. In [25] Intel Corporation. Intel 64 and IA-32 Architectures
USENIX Symposium on Operating Systems Design & Software Developer’s Manual. Reference number:
Implementation (OSDI), pages 43–57, 2008. 325462-057US, 2015. https://software.intel.c
[11] Austin T Clements, M Frans Kaashoek, and Nickolai om/en-us/articles/intel-sdm.
Zeldovich. RadixVM: Scalable address spaces for [26] Youngjin Kwon, Hangchen Yu, Simon Peter, Christo-
multithreaded applications. In ACM SIGOPS Euro- pher J Rossbach, and Emmett Witchel. Coordinated
pean Conference on Computer Systems (EuroSys), pages and efficient huge page management with Ingens. In
211–224, 2013. USENIX Symposium on Operating Systems Design &
[12] Joel Coburn, Adrian M Caulfield, Ameen Akel, Implementation (OSDI), pages 705–721, 2016.