0% found this document useful (0 votes)
180 views15 pages

Optimizing TLB Shootdown Efficiency

This paper proposes techniques to optimize the TLB shootdown algorithm used by operating systems to maintain consistency between core translation lookaside buffers (TLBs) after page table changes. The techniques leverage hardware page access tracking to identify mappings that are cached in only one core's TLB or not cached at all, allowing TLB shootdowns to be avoided in many cases. An implementation in Linux reduced TLB invalidations by up to 98% and improved performance of memory migration by up to 78% and multithreaded applications by up to 12%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views15 pages

Optimizing TLB Shootdown Efficiency

This paper proposes techniques to optimize the TLB shootdown algorithm used by operating systems to maintain consistency between core translation lookaside buffers (TLBs) after page table changes. The techniques leverage hardware page access tracking to identify mappings that are cached in only one core's TLB or not cached at all, allowing TLB shootdowns to be avoided in many cases. An implementation in Linux reduced TLB invalidations by up to 98% and improved performance of memory migration by up to 78% and multithreaded applications by up to 12%.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Optimizing the TLB Shootdown Algorithm with

Page Access Tracking


Nadav Amit, VMware Research
https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit

This paper is included in the Proceedings of the


2017 USENIX Annual Technical Conference (USENIX ATC ’17).
July 12–14, 2017 • Santa Clara, CA, USA
ISBN 978-1-931971-38-6

Open access to the Proceedings of the


2017 USENIX Annual Technical Conference
is sponsored by USENIX.
Optimizing the TLB Shootdown
Algorithm with Page Access Tracking

Nadav Amit
VMware Research

Abstract their TLBs according to the information supplied by


The operating system is tasked with maintaining the the initiator core, and they report back when they
coherency of per-core TLBs, necessitating costly syn- are done. TLB shootdown can take microseconds,
chronization operations, notably to invalidate stale causing a notable slowdown [48]. Performing TLB
mappings. As core-counts increase, the overhead of shootdown in hardware, as certain CPUs do, is faster
TLB synchronization likewise increases and hinders but still incurs considerable overheads [22].
scalability, whereas existing software optimizations In addition to reducing performance, shootdown
that attempt to alleviate the problem (like batching) overheads can negatively affect the way applications
are lacking. are constructed. Notably, to avoid shootdown la-
We address this problem by revising the TLB tency, programmers are advised against using mem-
synchronization subsystem. We introduce several ory mappings, against unmapping them, and even
techniques that detect cases whereby soon-to-be against building multithreaded applications [28, 42].
invalidated mappings are cached by only one TLB But memory mappings are the efficient way to use
or not cached at all, allowing us to entirely avoid persistent memory [18, 47], and avoiding unmap-
the cost of synchronization. In contrast to existing pings might cause corruption of persistent data [12].
optimizations, our approach leverages hardware OSes try to cope with shootdown overheads by
page access tracking. We implement our techniques batching them [21, 43], avoiding them on idle cores,
in Linux and find that they reduce the number of or, when possible, performing them faster [5]. But
TLB invalidations by up to 98% on average and thus the potential of these existing solutions is inherently
improve performance by up to 78%. Evaluations limited to certain specific scenarios. To have a gener-
show that while our techniques may introduce ally applicable, efficient solution, OSes need do know
overheads of up to 9% when memory mappings which mappings are cached by which cores. Such in-
are never removed, these overheads can be avoided formation can in principle be obtained by replicating
by simple hardware enhancements. the translation data structures for each core [11], but
this approach might result in significantly degraded
1. Introduction performance and wasted memory.
We propose to avoid unwarranted TLB shoot-
Translation lookaside buffers (TLBs) are perhaps the downs in a different manner: by monitoring access
most frequently accessed caches whose coherency bits. While TLB coherency is not maintained by the
is not maintained by modern CPUs. The TLB is CPU, CPU architectures can maintain the consis-
tasked with caching virtual-to-physical translations tency of access bits, which are set when a mapping
(“mappings”) of memory addresses, and so it is is cached. We contend that these bits can therefore
accessed upon every memory read or write operation. be used to reveal which mappings are cached by
Maintaining TLB coherency in hardware hampers which cores. To our knowledge, we are the first to
performance [33], so CPU vendors require OSes to use access bits in this way.
maintain coherency in software. But it is difficult for In the x86 architecture, which we study in this
OSes to efficiently achieve this goal [27, 38, 39, 41, 48]. paper, access bit consistency is maintained by the
To maintain TLB coherency, OSes employ the memory subsystem. Exploiting it, we propose tech-
TLB shootdown protocol [8]. If a mapping m that niques to identify two types of common mappings
possibly resides in the TLB becomes stale (due to whose shootdown can be avoided: (1) short-lived
memory mapping changes) the OS flushes m from private mappings, which are only cached by a single
the local TLB to restore coherency. Concurrently, core; and (2) long-lived idle mappings, which are
the OS directs remote cores that might house m in reclaimed after the corresponding pages have not
their TLB to do the same, by sending them an inter- been used for a while and are not cached at all. Using
processor interrupt (IPI). The remote cores flush

USENIX Association 2017 USENIX Annual Technical Conference 27


these techniques, we implement a fully functional x86 CPUs do not maintain coherence between the
prototype in Linux 4.5. Our evaluation shows that TLB and the page-tables, nor among the TLBs of
our proposal can eliminate more than 90% of TLB different cores. As a result, page-table changes may
shootdowns and improve the performance of mem- leave stale entries in the TLBs until coherence is
ory migration by 78%, of copy-on-write events by restored by the OS. The instruction set enables the
18–25%, and of multithreaded applications (Apache OS to do so by flushing (“invalidating”) individual
and parallel bzip2) by up to 12%. PTEs or the entire TLB. Global and individual TLB
Our system introduces a worst case slowdown flushes can only be performed locally, on the TLB of
of up to 9% when mappings are only set and never the core that executes the flush instruction.
removed or changed, which means no shootdown Although the TLB is essential to attain reasonable
activity is conducted. This slowdown is caused, translation latency, some workloads experience fre-
according to our measurements, due to the overhead quent TLB cache-misses [4]. Recently, new features
of our TLB manipulation software techniques. To were introduced into the x86 architecture to reduce
eliminate it, we propose a CPU extension that would the number and latency of TLB misses. A new instruc-
allow OSes to write entries directly into the TLB, and tion set extension allows each page-table hierarchy to
resembles the functionality provided by CPUs that be associated with an address-space ID (ASID) and
employ software-TLB. avoid TLB flushes during address-space switching,
thus reducing the number of TLB misses. Micro-
2. Background and Motivation architectural enhancements introduced page-walk
caches that enable the hardware to cache internal
2.1 Memory Management Hardware
nodes in the page-table hierarchy, thereby reducing
Virtual memory is supported by most modern CPUs TLB-miss latencies [3].
and used by all the major OSes [9, 32]. Using vir-
tual memory allows the OS to utilize the physical 2.2 TLB Software Challenges
memory more efficiently and to isolate the address The x86 architecture leaves maintaining TLB co-
space of each process. The CPU translates the virtual herency to the OSes, which often requires frequent
addresses to physical addresses before memory ac- TLB invalidations after PTE changes. OS kernels
cesses are performed. The OS sets the virtual address can make such PTE changes independently of the
translations (also called “mappings”) according to running processes, upon memory migration across
its policies and considerations. NUMA nodes [2], memory deduplications [49], mem-
The memory mappings of each address space are ory reclamation, and memory compaction for accom-
kept in a memory-resident data structure, which is modating huge pages [14]. Processes can also trigger
defined by the CPU architecture. The most common PTE changes by using system calls, for example
data structure, used by the x86 architecture, is a radix- mprotect, which changes protection on a memory
tree, which is also known as a page-table hierarchy. range, or by writing to copy-on-write pages (COW).
The leaves of the tree, called the page-table entries These PTE changes can require a TLB flush to
(PTEs), hold the translations of fixed-sized virtual avoid caching of stale PTEs in the TLB. We distin-
memory pages to physical frames. To translate a guish between two types of flushes: local and remote,
virtual address into a physical address, the CPU in accordance with the core that initiated the PTE
incorporates a memory management unit (MMU), change. Remote TLB flushes are significantly more
which performs a “page table walk” on the page expensive, since most CPUs cannot flush remote
table hierarchy, checking access permissions at every TLBs directly. OSes therefore perform a TLB shoot-
level. During a page-walk, the MMU updates the down: The initiating core sends an inter-processor
status bits in each PTE, indicating whether the page interrupt (IPI) to the remote cores and waits for
was read from and/or written to (dirtied). their interrupt handlers to invalidate their TLBs and
To avoid frequent page-table walks and their as- acknowledge that they are done.
sociated latency, the MMU caches translations of TLB shootdowns introduce a variety of overheads.
recently used pages in a translation lookaside buffer IPI delivery can take several hundreds of cycles [5].
(TLB). In the x86 architecture, these caches are main- Then, the IPI may be kept pending if the remote core
tained by the hardware, bringing translations into has interrupts disabled, for instance while running
the cache after page walks and evicting them accord- a device driver [13]. The x86 architecture does
ing to an implementation-specific cache replacement not allow OSes to flush multiple PTEs efficiently,
policy. Each x86 core holds a logically private TLB. requiring the OS to either incur the overhead of
Unlike memory caches, TLBs of different CPUs are multiple flushes or flush the entire TLB and increase
not maintained coherent by hardware. Specifically, the TLB miss rate. In addition, TLB flushes may

28 2017 USENIX Annual Technical Conference USENIX Association


indirectly cause lock contention since they are often 2.4 Per-Core Page Tables
performed while the OS holds a lock [11, 15]. It is Currently, the state-of-the-art software solution for
noteworthy that while some CPU architectures (e.g., TLB shootdowns is setting per-core page tables, and
ARM) enable to perform remote TLB shootdowns according to the experienced page-faults track which
without IPIs, remote shootdowns still incur higher cores used each PTE [11,19]. When a PTE invalidation
performance overhead than local ones [22]. is needed, a shootdown is sent only to cores whose
page tables hold the invalidated PTE.
2.3 OS Solutions and Shortcomings Maintaining per-core page tables, however, can
introduce substantial overheads when some PTEs
To reduce TLB related overheads, OSes employ
are accessed by multiple cores. In such a case, OS
several techniques to avoid unnecessary shootdowns,
memory management operations become more ex-
reduce their time, and avoid TLB misses.
pensive, as mapping modifications require changes
A TLB shootdown can be avoided if the OS can
the of PTEs in multiple page-tables. The overhead of
ensure that the modified PTE is either not cached in
PTE changes is not negligible, as some require atomic
remote TLBs or can be flushed at a later time, but
operations. RadixVM [11] reduces this overhead by
before it can be used for an address translation. In
changing PTEs in parallel: sending IPIs to cores that
practice, OSes can only avoid remote shootdowns in
hold the PTE and changing them locally. This scheme
certain cases. In Linux, for example, each userspace
is efficient when shootdowns are needed, as one IPI
PTE is only set in a single address space page-table
triggers both the PTE change and its invalidation.
hierarchy, allowing the OS to track which address
Yet, if a shootdown is not needed, for example when
space is active on each core and flush only the TLBs
the other cores run a different process, this solution
of cores that currently use this address space. The
may increase the overhead due to the additional IPIs.
TLB can be flushed during context switch, before any
Holding per-core page tables can also introduce
stale entry would be used.
high memory overheads if memory is accessed by
A common method to reduce shootdown time is
multiple cores. For example, in recent 288 core
to batch TLB invalidations if they can be deferred [21,
CPUs [24], if half of the memory is accessed by
47]. Batching, however, cannot be used in many
all cores, the page tables will consume 18% of the
cases, for example when a multithreaded application
memory or more if memory is overcommitted or
changes the access permissions of a single page.
mappings are sparse.
Another way to reduce shootdown overhead is
While studies showed substantial performance
to acknowledge its IPI immediately, even before
gains when per-core page tables are used, the limi-
invalidation is performed [5, 43].
tations of this approach may have not been studied
Flush time can be reduced by lowering the number
well enough. For example, in an experiment we con-
of TLB flushes. Flushing multiple individual PTEs is
ducted memory migration between NUMA nodes
expensive, and therefore OSes can prefer to flush the
was 5 times slower when memory was mapped in
entire TLB if the number of PTEs exceeds a certain
48 page-table hierarchies (of 48 Linux running pro-
threshold. This is a delicate trade-off, as such a flush
cesses in our experiment) instead of one. Previous
increases the number of TLB misses [23].
studies may have not shown these overheads as
Linux tries to balance between the overheads of
they considered a teaching OS, which lacks basic
TLB flushes and TLB misses when a core becomes
memory management features [11]. In addition, pre-
idle, using a lazy TLB invalidation scheme. Since the
vious studies experienced shootdown latencies of
process that ran before the core became idle may be
over 500k cycles, which is over 24x of the latency
scheduled to run again, the OS does not switch its
that we measured. Presumably, the high overhead
address space, in order to avoid potential future TLB
of shootdowns could overshadow other overheads.
misses. However, when the first TLB shootdown is
delivered to the idle core, the OS performs a full TLB
invalidation and indicates to the other cores not to 3. The Idea
send it further shootdown IPIs while it is idle. The challenge in reducing TLB shootdown overhead
Despite all of these techniques, shootdowns can is determining which cores, if at all, might be caching
induce high overheads in real systems. Arguably, this a given PTE. Although architectural paging struc-
overhead is one of the reasons people refrain from tures do not generally provide this information, we
using multithreading, in which mapping changes contend that the OS can nevertheless deduce it by
need to propagate to all threads. Moreover, applica- carefully tracking and manipulating PTE access-bits.
tion writers often prefer copying data over memory The proclaimed goal of access bits is to indicate if
remapping, which requires TLB shootdown [42]. memory pages have been accessed. This functional-

USENIX Association 2017 USENIX Annual Technical Conference 29


ity is declared by architectural manuals and is used mapping and (2) make sure that other cores have not
by OSes to make informed swapping decisions. Our used it too at a later time. As previously shown [27],
insight is that access bits can be additionally used the first item is achievable via demand paging,
for a different purpose: to indicate if PTEs are cached the standard memory management technique that
in TLBs, as explained next. OSes employ, which traps upon the first access to a
Let us assume: that (1) a PTE e might be cached memory page and only then sets a valid mapping [9].
by a set of cores S at time t0 ; that (2) e’s access bit is The second item, however, is more challenging,
clear at t0 (because it was never set, or because the as existing approaches to detect PTE sharing can
OS explicitly cleared it); and that (3) this bit is still introduce overheads that are much higher than those
clear at some later time t1 . Since access bits are set we set out to eliminate (§6).
by hardware whenever it caches the corresponding
translations in the TLB [25], we can safely conclude Direct TLB Insertion Our goal is therefore to find
that e is not cached by any core c < S at t1 . a low-overhead way to detect PTE sharing. As a
We note that our reasoning rests on the fact that first step, we note that this goal would have been
last-level TLBs are private per core [6, 27, 29] and so easily achievable if it was possible to conduct direct
translations are not transferred between them. Linux, TLB insertion—inserting a mapping m directly into
for example, relies on this fact when shooting down a TLB of a core c without setting the access bit of
a PTE of some address space α while avoiding the the corresponding PTE e. Given such a capability, as
shootdown at remote cores whose current address long as m resides in the TLB, subsequent uses of m
spaces are different than α (§2.3). This optimization by c would not set the access-bit of e, as no page table
would have been erroneous if TLBs were shared, walks are needed. In contrast, if some other core c̄
because Linux permits the said remote cores to ends up using m as well, the hardware will walk the
freely load α while the shootdown takes place, which page table when inserting m to the TLB of c̄, and it
would have allowed them to cache stale mappings will therefore set e’s access bit, thereby indicating
from a shared last-level TLB, thereby creating an that m is not private.
inconsistency bug. Direct TLB insertion would have thus allowed
We identify two types of mappings that can help us to use turned-off access bits as identifiers of
us optimize TLB shootdown by leveraging access-bit private mappings. We remark that this method is best-
information. The first is short-lived private mappings of effort and might lead to false-positive indications
pages that are accessed exclusively by a single thread of sharing in cases where m is evicted from the
and then removed shortly after; this access pattern TLB and reinserted later. This issue does not affect
may be exhibited, for example, by multithreaded correctness, however. It simply implies that some
applications that use memory-mapped files to read useless shootdown activity is possible. The approach
data. The second type is long-lived idle mappings of is thus more suitable for short-lived PTEs.
pages that are reclaimed by the OS after they have Alas, current x86 processors do not support di-
not been accessed for a while; this pattern is typical rect TLB insertion. One objective of this study is
for pages that cease to be part of the working set of to motivate such support. When proposing a new
a process, prompting the OS to unmap them, flush hardware feature, architects typically resort to simu-
their PTEs, and reuse their frames elsewhere. lation since it is unrealistic to fabricate chips to test
research features. We do not employ simulation for
two reasons. First, because we suspect that it might
4. The System yield questionable results, as the OS memory man-
Using the above reasoning (§3), we next describe agement subsystems that are involved are complex
the Linux enhancements we deploy on an x86 to realistically simulate. Second, since TLB insertion
Intel machine to optimize TLB shootdown of short- is possible on existing hardware even without hard-
lived private mappings (§4.1) and long-lived idle ware support, and can benefit workloads that are
mappings (§4.2). We then describe “software-PTEs”, sensitive to shootdown overheads, shortening run-
the data structures we use when implementing our 1
times by 0.56x (= 1.78 ; see Figure 5) at best. Although
mechanisms (§4.3). To distinguish our enhancements runtimes might be 1.09x longer in the worst case,
from the baseline OS, we collectively denote them our results indicate that real hardware support will
as ABIS—access-based invalidation system. eliminate this overhead (§5.1).
Note that although direct TLB insertion is not
4.1 Private PTE Detection supported in the x86 architecture, it is supported
To avoid TLB shootdown due to a private mapping, in CPUs that employ software-managed TLBs. For
we must (1) identify the core that initially uses this example, Power CPUs support the tlbwe instruction

30 2017 USENIX Annual Technical Conference USENIX Association


that can insert PTE directly into the TLB. We there- (1) change CR3 to secondary hierarchy
[same PCID, different address]
fore consider this enhancement achievable with a

CR3
0s adrs0 pcid0 0s adrs1 pcid0
reasonable effort. (4) change CR3 back
primary secondary
to primary hierarchy

PGDs
Approximation Let us first rule out the naive ap- … …
proach to approximate direct TLB insertion by: (1) set-

PUDs
ting a PTE e; (2) accessing the page and thus prompt- … … …
ing hardware to load the corresponding mapping

PMDs
m into the TLB and to set e’s access bit; and then … … …
(3) having the OS clear e’s access bit. This approach is
buggy due to the time window between the second

PTEs
… … …
and third items, which allows other cores to cache
m in their TLBs before the bit is cleared, resulting in (2) wire both hierarchies (5) zero
[primary bit is 0] page secondary
a false sharing indications that the page is private. PTE
Shootdown will then be erroneously skipped.
(3) read page
We resolve this problem and avoid the above race CR3 change [translation loaded to TLB from
phys. address secondary CR3, primary bit remains 0]
by using Intel’s address space IDs, which is known access bit
as process-context identifiers (PCIDs) [25]. PCIDs PTE
enable TLBs to hold mappings of multiple address
spaces by associating every cached PTE with a PCID Figure 1: Direct TLB insertion using a secondary hierarchy.
of its address space. The PCID of the current address
space is stored in the same register as the pointer to secondary spaces, leaving the corresponding access
the root of the page table hierarchy (CR3), and TLB bit in the primary hierarchy clear (Step 2).
entries are associated with this PCID when they are Then, ABIS reads from the page. Because the asso-
cached. The CPU uses for address translation only ciated mapping is currently missing from the TLB (a
PTEs whose PCID matches the current one. This page fault fired), and because CR3 currently points
feature is intended to allow OSes to avoid global to the secondary space, reading the page prompts
TLB invalidations during context switch and reduce the hardware to walk the secondary hierarchy and
the number of TLB misses. to insert the appropriate translation to the TLB, leav-
PCID is not currently used by Linux due to the ing the primary bit clear (Step 3). Importantly, the
limited number of supported address spaces and inserted translation is valid and usable within the
questionable performance gains from TLB miss re- primary space, because both spaces have the same
duction. We indeed exploit this feature in a different PCID and point to the same physical page using the
manner. Nevertheless, our use does not prevent or same virtual address. This approach eliminates he
limit future PCID support in the OS. aforementioned race: no other core is able to access
The technique ABIS employs to provide direct TLB the secondary space, as it is private to the core.
insertion is depicted in Figure 1. Upon initialization, After reading the page, ABIS loads the primary
ABIS preallocates for each core a “secondary” page- hierarchy back to CR3, to allow the thread to continue
table hierarchy, which consists of four pages, one as usual (Step 4). It then clears the PTE from the
for each level of the hierarchy. The uppermost level secondary space, thereby preventing further use of
of the page-table (PGD) is then set to point to the translation data from the secondary hierarchy that
kernel mappings (like all other address spaces). The may have been cached in the hardware page-walk
other three pages are not connected at this stage to cache (PWC). If the secondary tables are used by the
the hierarchy, but wired dynamically later according CPU for translation, no valid PTE will be found and
to the address of the PTE that is inserted to the TLB. the CPU will restart a page-walk from the root entry.
While executing, the currently running thread T Finally, using our “software-PTE” (SPTE) data
occasionally experiences page faults, notably due structure (§4.3), ABIS associates the faulting PTE e
to demand paging. When a page fault fires, the OS with the current core c that has just resolved e. When
handler is invoked and locks the PT that holds the the time comes to flush e, if ABIS determines that e
faulting PTE—no other core will simultaneously is still private to c, it will invalidate e on c only, thus
handle the same fault. avoiding the shootdown overhead.
At this point, ABIS loads the secondary space to
CR3 along with a PCID equal to that of T (Step 1 in Coexisting with Linux Linux reads and clears ar-
Figure 1). After, ABIS wires the virtual-to-physical chitectural access bits (hwA-s) via a small API, allow-
mapping of the target page in both primary and ing us to easily mask these bits while making sure

USENIX Association 2017 USENIX Annual Technical Conference 31


that both Linux and ABIS simultaneously operate
correctly. Notably, when Linux attempts to clear an
hwA, ABIS (1) checks whether the bit is turned on, faulted-in private
in which case it (2) clears the bit and (3) records CPU=[current]
in the SPTE the fact that the associated PTE is not ver=[AS].ver
PTE flush
private (using the value ALL_CPUS discussed further
below). Note, however, that Linux and ABIS can PTE change when
use the access bit in a conflicting manner. For ex- access-bit is set
uncached
ample, after a page fault, Linux could expect to see ver=uncached
PTE change when
the access bit turned on, whereas ABIS’s direct TLB access-bit is set
potentially
insertion makes sure that the opposite happens. To
shared
avoid any such conflicts, we maintain in the SPTE a CPU=all
new per-PTE “software access bit” (swA) for Linux, ver=[AS].ver
which reflects Linux’s expectations. The swA bits are PTE flush
governed by the following rules: upon a page fault, Figure 2: A finite state machine that describes the various states
we set the swA; when Linux clears the bit, we clear of a PTE. In each state, the assignment of the caching core and
the swA; and when Linux queries the bit, we return version are denoted. On each transition the access-bit is cleared.
an OR’d value of swA and hwA. These rules ensure
that Linux always observes the values it would have
observed in an ABIS-less system. flush version number” for S, such that the version is
ABIS attempts to reduce false indications of PTE incremented whenever all cores c ∈ S perform a full
sharing when possible. We find that Linux performs TLB flush. Recording this version for each e at the
excessive full flushes to reduce the number of IPIs time e is updated would then allow us to employ the
sent to idle cores as part of the shootdown procedure optimization.
(§2.3). In Linux, this behavior is beneficial as it TLB version tracking The most accurate way to
reduces the number of TLB shootdowns at the cost track full flushes is by maintaining a version for
of more TLB misses, whose impact is relatively small. each core, advancing it after each local full flush,
In our system, however, this behavior can result in and storing a vector of the versions for every PTE.
more shootdowns, as it increases the number of false Then, if a certain core’s version differs from the
indications. ABIS therefore relaxes this behavior, corresponding vector coordinate (and the access-
allowing idle cores to service a few individual PTE bit is clear), a flush on that core is not required.
flushes before resorting to a full TLB flush. Despite its accuracy, this scheme is impractical, as it
consumes excessive memory and requires multiple
Overhead Overall, the overhead of direct TLB in-
memory accesses to update version vectors. We
sertions in our system is ≈550 cycles per PTE (respon-
therefore trade off accuracy in order to reduce the
sible for the worst-case 9% slowdown mentioned
memory consumption of versions and the overheads
earlier). This overhead is amortized when multi-
of updating them.
ple PTEs are mapped together, for example, via one
ABIS therefore tracks versions for each address
mmap system-call invocation, or when Linux serves
space (AS, corresponds to the above S) and not for
a page-fault on a file-backed page and maps adjacent
each core. To this end, for every AS, we save a version
PTEs to avoid future page-faults [36].
number and a bitmask that marks which cores have
4.2 TLB Version Tracking not performed a full TLB flush in the current version.
The last core to perform a full TLB flush in a certain
Based on our observations from §3, we build a TLB version advances the version. At the same time, it
version tracking mechanism to avoid flushes of long- marks in the bitmask which cores currently use this
lived idle mappings. Let us assume that a PTE e AS and can therefore cache PTEs in the next version.
might be cached by a set of cores S at time t0 , and To mitigate cache line bouncing, the core that initiates
that each core c ∈ S performed a full TLB flush during a TLB shootdown updates the version on behalf the
the time period (t0 , t1 ). If at time t1 the access bit of target cores.
e remains clear (i.e., was not cleared by software),
then we know for a fact e is not cached by any Avoiding flushes After a PTE access-bit is cleared,
TLB. If the OS obtained the latter information by ABIS stores the current AS version as the PTE version.
atomically reading and zeroing e, then all TLB flushes Determining later whether a shootdown is needed
associated with e (local and remote) can be avoided. requires some attention, as even if the PTE and the AS
To detect such cases, we first need to maintain a “full- versions differ, a flush may be necessary. Consider

32 2017 USENIX Annual Technical Conference USENIX Association


15 9 8 7 0
Flush PTE

True generation swA caching core


PTE.A = 1

False software page table (SPT)


SPTE.ver = True
uncached SPT pointer
63 11 0
False
PT spin-lock
SPTE.ver < True PFN stat/protection
No flush frame meta-data
AS.ver-1 ( struct page )
False page table (PT)

Flush all True SPTE.CPU = False Flush Figure 4: Software PTE (SPTE) and its association to the page
CPUs all SPTE.CPU table through the meta-data of the page-table frame.

Figure 3: Flush type decision algorithm.


state machine,as shown in Figure 2. In the “uncached”
state a TLB flush is unnecessary; in the “private”
a situation in which the access-bit is cleared, and
state at most one CPU needs to perform a TLB flush;
the PTE version is updated to hold the AS version.
and in the “potentially shared” state all the CPUs
At this time, some of the cores may have already
perform TLB flush.1 In the latter two states, a TLB
flushed their TLB for the current AS version, and
flush may still be avoided if the access-bit is clear
their respective bit in the bitmask is clear. The AS
and the current address space version is at least two
version may therefore advance before these cores
versions ahead of the PTE version. Figure 3 shows
flush their TLB again, and these cores can hold stale
ABIS flush decision algorithm.
PTEs even when the versions differ. Thus, our system
avoids shootdown only if there is a gap of at least 4.3 Software PTEs
one version between the AS and the PTE versions,
As we noted before, for our system to perform in-
which indicates a flush was performed on all cores.
formed TLB invalidation decisions, additional infor-
Since flushes cannot be avoided when the access-
mation must be saved for each PTE: the PTE ver-
bit is set, this bit should be cleared and the PTE
sion, the CPU which caches the PTE, and a software
version updated as frequently as possible, assuming
access-bit. Although we are capable of squeezing
it introduces negligible overheads. In practice, ABIS
this information into two bytes, the architectural PTE
clears the bit and updates the version whenever the
only accommodates three bits for software use. We
OS already accesses a PTE for other purposes, for
therefore allocate a separate “software page-table”
example during an mprotect system-call or when
(SPT) for each PT, which holds the corresponding
the OS considers a page for reclamation.
“software-PTEs” (SPTEs). The SPTE is not used by
Uncached PTEs The version tracking mechanism the CPU during page-walks and therefore causes
can also prevent unwarranted multiple flushes of little cache pollution and overhead.
the same PTE. Such flushes may occur, for example, An SPTE is depicted in Figure 4. We use 7 bits
when a user first calls an msync system call, which for the version, 1 bit for the software access-bit, and
performs writeback of a memory mapped file, and another byte to track the core that caches the PTE
then unmaps the file. Both operations require flush- if the access-bit is cleared. We want to define the
ing the TLB since the first clears PTEs’ dirty-bit and SPTE in a manner that ensures a zeroed SPTE would
the second sets a non-present PTE. However, if the behave in the legacy manner, allowing us to make
PTE was not accessed after the first flush, the second fewer code changes. To do so, we reserve the zero
flush is unnecessary, regardless of whether a full TLB value of the “caching core” field to indicate that
flush happened in between. To avoid this scenario, the PTE may be cached by all CPUs (ALL_CPUS) and
we set a special version value, UNCACHED, as the PTE instead store the core number plus one.
version when it is flushed. This value indicates the When the OS wishes to access the SPTE of a certain
PTE is not cached in any TLB if the access-bit is PTE, it should be able to easily access it. Yet the PTE
cleared, regardless of the current AS version. cannot accommodate a pointer to its SPTE. A possible
solution is to allocate two page-frames for each page-
Coexisting with Private PTE Detection Version
tracking coexists with private PTE detection. The 1 A TLB flush is not required on CPUs that currently use a different

interaction between the two can be described in a page-table hierarchy as explained in §2

USENIX Association 2017 USENIX Annual Technical Conference 33


0 1 (e.g., in the Apache benchmark shown later) and no
runtime 112

normalized TLB shootdowns


shootdowns 368 25 impact when huge pages are used (e.g., PBZIP2).
1 0.8 As a fast block device for our experiments we
normalized runtime

13
6
0.8 use ZRAM, a compressed RAM block device, which
0.6
is used by Google Chrome OS and Ubuntu. This
0.6 853
0.4
device latency is similar to that of emerging non-
0.4 volatile memory modules. In our test, we disable
0.2 0.2 memory deduplication and deep sleep states which
297 may increase the variance of the results.
632 305 149 160
0 0
migrate

cow-seq-mt

cow-rand-mt

mmap-read

msync-mt

anon-r-seq
5.1 VM-Scalability
We use the vm-scalability test suite [34], which is
used by Linux kernel developers to exercise the
Figure 5: Normalized runtime and number of TLB shootdowns in kernel memory management mechanisms, test their
ABIS when running vm-scalability benchmarks. The numbers correctness and measure their performance.
above the bars indicate the baseline (left) runtime in seconds and We measure ABIS performance by running bench-
(right) rate of TLB shootdowns in thousands/second. marks that experience high number of TLB shoot-
downs.2 To run the benchmarks in a reasonable time,
we limit the amount of memory each test consumes
table, one holding the CPU architectural PTEs and
to 32GB. Figure 5 presents the measured speedup,
the second holding the corresponding SPTEs, each
the runtime, the relative number of sent TLB shoot-
in a fixed offset from its PTE. While this scheme is
downs and their rate. We now discuss these results.
simple, it wastes memory as it requires the SPTE to
be the same size as a PTE (8B), when in fact SPTE Migrate. This benchmark reads a memory mapped
only occupies two bytes. file and waits while the OS is instructed to migrate
We therefore allocate an SPT separately during the the process memory between NUMA nodes. During
PT construction, and set a pointer to the SPT in the migration, we set the benchmark to perform a busy-
PT page-frame meta-data (page struct). Linux can wait loop to practice TLB flushes. We present the
quickly retrieve this meta-data, allowing us to access time that a 1TB memory migration would take. ABIS
the SPTE of a certain PTE with small overhead. The reduces runtime by 44% and shootdowns by 92%.
SPTE pointer does not increase the page-frame meta- Multithreaded copy-on-write (cow-mt). Multiple
data, as it is set in an unused PT meta-data field threads read and write a private memory mapped
(second quadword). The SPT therefore increases file. Each write causes the kernel to copy the original
page table memory consumption by 25%. ABIS page, update the PTE to point to the copy, and flush
prevents races during SPT changes by protecting it the TLB. ABIS prevents over 97% of the shootdowns,
with the same lock that is used to protect PT changes. reducing runtime by 20% for sequential memory
It is noteworthy that although SPT management accesses and 15% for random by avoiding over 97%.
introduces a overhead, it is negligible relatively to
Memory mapped reads (mmap-read). Multiple
other overheads in the workloads we evaluated.
processes read a big sparse memory mapped file. As
a result, memory pressure builds up, and memory
5. Evaluation
is reclaimed. While almost all the shootdowns are
We implemented a fully-functional prototype of eliminated, the runtime is not affected, as apparently
the system, ABIS, which is based on Linux 4.5. there are more significant overheads, specifically
As a baseline system for comparison we use the those of the page frame reclamation algorithm.
same version of Linux, which includes recent TLB
shootdown optimizations. We run each test 5 times Multithreaded msync (msync-mt). Multiple threads
and report the average result. Our testbed consists access a memory mapped file and call the msync
of a two-socket Dell PowerEdge R630 with Intel 24- system-call to flush the memory changes to the file.
cores Haswell EP CPUs. We enable x2APIC cluster- msync can cause an overwhelming number of flushes,
mode, which speeds up IPI delivery. as the OS clears the dirty-bit. ABIS eliminates 98%
In our system we disable transparent huge pages of the shootdowns but does not reduce the runtime,
(THP), which may cause frequent full TLB flushes, in- as file system overhead appears to be the main
crease the TLB miss-rate [4] and introduce additional performance bottleneck.
overheads [26]. In practice, when THP is enabled, 2 We find that due to some benchmarks practice unrealistic
ABIS still shows benefit when small pages are used scenarios. Our revised tests are released with ABIS code.

34 2017 USENIX Annual Technical Conference USENIX Association


140 1.5 120 700

received shootdowns [thousands/sec]


sent shootdowns [thousands/sec]
120 1.4 100 600
requests/sec [thousands]

100 500
1.3 80

speedup
80 400
1.2 60
60 300
1.1 40 baseline - send
40 200
ABIS - send
baseline 1 20 baseline - receive
20 ABIS - receive 100
ABIS
speedup
0 0.9 0 0
5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40
cores [#] cores [#]
(a) Throughput (b) TLB shootdowns
Figure 6: Execution of an Apache web server which serves the Wrk workload generator.

Anonymous memory read (anon-r-seq). To evalu- when all cores are used. Executing the benchmark
ate ABIS overheads we run a benchmark that per- reveals that the effect of ABIS on performance is
forms sequential anonymous memory reads and inconsistent when the number of cores is low, as
does not cause TLB shootdowns. This benchmark’s ABIS causes slowdown of up to 8% and speedups
runtime is 9% longer using ABIS. Profiling the bench- of to 42%. Figure 6b presents the number of TLB
mark shows that the software TLB manipulations shootdown that are sent and received in the baseline
consume 9% of the runtime, suggesting that hard- system and ABIS. As shown, in the baseline sys-
ware enhancements to manipulate the TLB can elim- tem, as more cores are used, the amount of sent TLB
inate most of the overheads. shootdowns becomes almost identical to the number
of requests that Apache serves. ABIS reduces the
5.2 Apache Web Server number of both sent and received shootdowns by
Apache is the most widely used web server soft- up to 90% as it identifies that PTEs are private and
ware. In our tests, we use Apache v2.4.18 and enable that local invalidation would suffice.
buffered server logging for more efficient disk ac-
cesses. We use the multithreaded Wrk workload 5.3 PBZIP2
generator to create web requests [50], and set it to Parallel bzip2 (PBZIP2) is a multithreaded imple-
repeatedly request the default Apache web page for mentation of the bzip2 file compressor [20]. In this
30 seconds, using 400 connections and 6 threads. benchmark we evaluate the effect of reclamation
We use the same server for both the generator and due to memory pressure on PBZIP2, which in itself
Apache, and isolate each one on a set of cores. We does not cause many TLB flushes. We use PBZIP2 to
ensure that the generator is unaffected by ABIS. compress the Linux 4.4 tar file. We configured the
Apache provides several multi-processing mod- benchmark to read the input file into RAM and split
ules. We use the default “mpm_event” module, it between processors using 500k block size. We run
which spawns multiple processes, each of which PBZIP2 in a container and limit its memory to 300MB
runs multiple threads. Apache serves each request to induce swap activity. This activity causes the in-
by creating a memory mapping of the requested validation of long-lived idle mappings as inactive
file, sending its content and unmapping it. This be- memory is reclaimed.
havior effectively causes frequent invalidations of The time of compression is shown in Figure 7a.
short-lived mappings. In the baseline system, the in- ABIS outperforms Linux by up to 12%, and the
validation also requires expensive TLB shootdowns speedup grows with the number of cores. Figure 7b
to the cores that run other threads of the Apache presents the number of TLB shootdowns per second
process. Effectively, when Apache serves concurrent when this benchmark runs. The baseline Linux
requests using multiple threads, it triggers a TLB system sends nearly 200k shootdowns regardless of
shootdown for each request that it serves. the number of threads, and the different shootdown
Figure 6a depicts the number of requests per sec- send rate is merely due to the shorter runtime
ond that are served when the server runs on different when the number of cores is higher. The number
number of cores. ABIS improves performance by 12% of received shootdowns in the baseline system is

USENIX Association 2017 USENIX Annual Technical Conference 35


10 1.14 35 600

received shootdowns [thousands/sec]


baseline baseline - send

sent shootdowns [thousands/sec]


ABIS ABIS - send
speedup 1.12 30 baseline - receive 500
9 ABIS - receive
1.1 25
runtime [seconds]

400
8

speedup
1.08 20
300
7 1.06 15
200
1.04 10
6
1.02 5 100

5 1 0 0
5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45
threads [#] threads [#]
(a) Runtime (b) Remote shootdowns.
Figure 7: Execution of PBZIP2 when compressing the Linux kernel. The process memory is limited to 300MB to practice page
reclamation.

proportional to the number of cores, as the OS 5.5 Limitations


cannot determine which TLBs cache the entry, and ABIS is not free of limitations. The additional opera-
broadcasts the shootdown messages to all the cores tions and data introduce performance and memory
that run the process threads. In contrast, ABIS can overheads, specifically the insertions of PTEs into
usually determine that a single TLB needs to be the TLB without setting the access-bit. However, rel-
flushed. When 48 threads are spawned, a shootdown atively simple hardware enhancements could have
is sent on average to 10 remote cores in ABIS, and to eliminated most of the overhead (§7). In addition,
18 cores using baseline Linux. the CPU incurs overhead of roughly 600 cycles when
it sets the access-bit of shared PTEs [37].
5.4 PARSEC Benchmark Suite
To detect short-lived private mappings, our sys-
We run the PARSEC 3.0 benchmark suite [7], which tem requires that the TLB be able to accommodate
is composed of multithreaded applications that are them during their lifetime. New CPUs include rather
intended to represent emerging shared-memory large TLBs of up to 1536 entries, which may map
programs. We set up the benchmark suite to use the 6MB of memory. However, non-contiguous or very
native dataset and spawn 32 threads. The measured large working sets may cause TLB pressure, induce
speedup, the runtime, the normalized number of TLB evictions, and cause false indications that PTEs are
shootdowns and their rate in the baseline system shared. In addition, frequent full TLB flushes, for in-
are presented in Figure 8. As shown, ABIS can stance during address-space switching or when the
improve performance by over 3% but can also induce OS sets the CPU to enter deep sleep-state have simi-
overheads of up to 2.5%. ABIS reduces the number lar implications. Process migration between cores is
of TLB shootdowns by 96% on average. also damaging as it causes PTEs to be shared between
The benefit of ABIS appears to be limited by the cores and requires shootdowns. These limitations
overhead of the software technique it uses to insert are often irrelevant to a well-tuned system [30, 31].
PTEs into the TLB. As this overhead is incurred after Finally, our system relies on micro-architectural
each page fault, workloads which trigger consider- behavior of the TLBs. We assume the MMU does not
ably more page faults than TLB shootdowns experi- perform involuntary flushes and that the same PTE
ence slowdown. For example, “canneal” benchmark is not marked as “accessed” multiple times when it
causes 1.5k TLB shootdowns per second in the base- is already cached. Experimentally, this is not always
line system, and ABIS prevents 91% of them. How- the case. We further discuss these limitations in §7.
ever, since the benchmark triggers over 55k page-
faults per second, ABIS reduces performance by 2.5%.
In contrast, “dedup” triggers 33k shootdowns and 6. Related Work
370k page faults per second correspondingly. ABIS Hardware Solutions. The easiest solution from a
saves 39% of the shootdowns and improves perfor- software point of view is to maintain TLB coherency
mance by 3%. Hardware enhancements or selective in hardware. DiDi uses a shared second-level TLB
enabling of ABIS could prevent the overheads. directory that tracks which PTEs are cached by which

36 2017 USENIX Annual Technical Conference USENIX Association


1.04 1 reduce the number of TLB flushes. However, these

normalized TLB shootdowns


runtime solutions are often workload dependent [45].
1.03 37 shootdowns 16
normalized runtime

49 0.8
1.02
10
1.01 14
33 15 29
32
0.6
7. Hardware Support
37 65
1 Although our system saves most of the TLB shoot-
0.4
0.99 0 1
downs, it does introduce some overheads. Hardware
0.98
1 0.2
support that would allow privileged OSes to insert
0.97 1 6 1 7
2 1 PTEs directly to the TLB without setting the access-
7
0.96 0 bit would eliminate most of ABIS’s overhead. Such
blackscholes
bodytrack
canneal
dedup
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
an enhancement should be easy to implement as we
achieve an equivalent behavior in software.
ABIS would able to save even more TLB flushes if
CPUs avoid setting the PTE access-bit after the PTE
Figure 8: Normalized runtime and number of TLB shootdowns is cached in the TLBs. We encountered, however,
in ABIS when running PARSEC benchmarks. The numbers in situations where such events occur. It appears
above the bars indicate the baseline (left) runtime in seconds and that when Intel CPUs set the PTE dirty-bit due to
(right) rate of TLB shootdowns in thousands/second. write access, they also set the access-bit, even if the
PTE is already cached in the TLB. Similarly, before a
CPU triggers a page-fault, it performs a page-walk
core and performs TLB flushes accordingly [48]. to retrieve the updated PTE from memory and may
Teller et al. proposed that OSes save a version count set the access-bit even if the PTE disallows access.
for each PTE, to be used by hardware to perform TLB Since x86 CPUs invalidate the PTE immediately after,
invalidations only when memory is addressed via a before invoking the page-fault exception handler,
stale TLB entry [39]. Li et al. eliminate unwarranted setting the access-bit is unnecessary.
shootdowns of PTEs that are only used by a single CPUs should not invalidate the TLB unnecessarily,
core by extending PTEs to accommodate the core as such invalidations hurt performance regardless of
that first accessed a page, enhancing the CPU to track ABIS. ABIS is further affected, as these invalidations
whether a PTE is private and avoiding shootdowns cause the the access-bit to be set again when the
accordingly [27]. CPU re-caches the PTE. We found that Intel CPUs
These studies present compelling evaluation (unlike AMD CPUs) may perform full TLB flushes
results; however, they require intrusive micro- when virtual machines invalidate huge pages that
architecture changes, which CPU vendors are appar- are backed by small host pages.
ently reluctant to introduce, presumably due to a
history of TLB bugs [1, 16, 17, 35, 46]. 8. Conclusion
Software Solutions. To avoid unnecessary recur- We have presented two new software techniques
that prevent TLB shootdowns in common cases,
ring TLB flushes of invalidated PTEs, Uhlig tracks
without replicating the mapping structures and
TLB versions and avoids shootdowns when the re-
mote cores already performed full TLB flushes after without incurring more page-faults. We have shown
the PTE changed [43, 44]. However, the potential of its benefits in a variety of workloads. While our
this approach is limited since even when TLB invali- system introduces overheads in certain cases, these
dations are batched, the TLB is flushed shortly after can be reduced by minor CPU enhancements. Our
study suggests that providing OSes better control
the last PTE is modified.
over TLBs may be an efficient and simple way to
An alternative approach for reducing TLB flushes
is to require applications to inform the OS how mem- reduce TLB coherency overheads.
ory is used or to control TLB flushes explicitly. Corey
Availability
OS avoids TLB shootdowns of private PTEs by re-
quiring that user applications define which memory The source code is publicly available at:
ranges are private and which are shared [10]. C4 uses http://nadav.amit.to/publications/tlb.
an enhanced Linux version that allows applications
to control TLB invalidations [40]. These systems, Acknowledgment
however, place an additional burden on application This work could not have been done without the
writers. Finally, we should note that reducing the continued support of Dan Tsafrir and Assaf Schuster.
number of memory mapping changes, for example I also thank the paper reviewers and the shepherd
by improving the memory reclamation policy, can Jean-Pierre Lozi.

USENIX Association 2017 USENIX Annual Technical Conference 37


References Laura M Grupp, Rajesh K Gupta, Ranjit Jhala, and
Steven Swanson. NV-Heaps: making persistent ob-
[1] Lukasz Anaczkowski. Linux VM workaround for
jects fast and safe with next-generation, non-volatile
Knights Landing A/D leak. Linux Kernel Mailing
memories. In ACM Architectural Support for Program-
List, lkml.org/lkml/2016/6/14/505, 2016.
ming Languages & Operating Systems (ASPLOS), pages
[2] Manu Awasthi, David W Nellans, Kshitij Sudan, Ra- 105–118, 2011.
jeev Balasubramonian, and Al Davis. Handling the
[13] Jonathan Corbet. Realtime and interrupt la-
problems and opportunities posed by multiple on-
tency. LWN.net, https://lwn.net/Articles/
chip memory controllers. In ACM/IEEE International
139784/, 2005.
Conference on Parallel Architecture & Compilation Tech-
niques (PACT), pages 319–330, 2010. [14] Jonathan Corbet. Memory compaction. LWN.net,
https://lwn.net/Articles/368869/, 2010.
[3] Thomas W Barr, Alan L Cox, and Scott Rixner. Trans-
lation caching: skip, don’t walk (the page table). In [15] Jonathan Corbet. Memory management locking.
ACM/IEEE International Symposium on Computer Ar- LWN.net, https://lwn.net/Articles/591978/,
chitecture (ISCA), pages 48–59, 2010. 2014.
[4] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, [16] Christopher Covington. arm64: Work around Falkor
Mark D Hill, and Michael M Swift. Efficient virtual erratum 1003. Linux Kernel Mailing List, https://
memory for big memory servers. In ACM/IEEE Inter- lkml.org/lkml/2016/12/29/267, 2016.
national Symposium on Computer Architecture (ISCA), [17] Linux Kernel Driver DataBase. CON-
pages 237–248, 2013. FIG_ARM_ERRATA_720789. http://cateee.net/
[5] Andrew Baumann, Paul Barham, Pierre-Evariste Da- lkddb/web-lkddb/ARM_ERRATA_720789.html, 2012.
gand, Tim Harris, Rebecca Isaacs, Simon Peter, Timo- [18] Jake Edge. Persistent memory. LWN.net, https://
thy Roscoe, Adrian Schüpbach, and Akhilesh Sing- lwn.net/Articles/591779/, 2014.
hania. The multikernel: a new OS architecture for
scalable multicore systems. In ACM Symposium on Op- [19] Balazs Gerofi, Akira Shimada, Atsushi Hori, and Yozo
erating Systems Principles (SOSP), pages 29–44, 2009. Ishikawa. Partially separated page tables for effi-
cient operating system assisted hierarchical mem-
[6] Abhishek Bhattacharjee, Daniel Lustig, and Margaret ory management on heterogeneous architectures. In
Martonosi. Shared last-level TLBs for chip multi- IEEE/ACM International Symposium on Cluster, Cloud
processors. In IEEE International Symposium on High and Grid Computing (CCGrid), pages 360–368, 2013.
Performance Computer Architecture (HPCA), pages 62–
63, 2011. [20] Jeff Gilchrist. Parallel data compression with bzip2.
In IASTED International Conference on Parallel and Dis-
[7] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, tributed Computing and Systems (ICPDCS), volume 16,
and Kai Li. The PARSEC benchmark suite: Character- pages 559–564, 2004.
ization and architectural implications. In ACM/IEEE
International Conference on Parallel Architecture & Com- [21] Mel Gorman. TLB flush multiple pages per IPI v4.
pilation Techniques (PACT), pages 72–81, 2008. Linux Kernel Mailing List, https://lkml.org/lkml/
2015/4/25/125, 2015.
[8] David L Black, Richard F Rashid, David B Golub,
Charles R Hill, and Robert V Baron. Translation [22] Julien Grall. Force broadcast of TLB and instruction
lookaside buffer consistency: a software approach. In cache maintenance instructions. Xen development
ACM Architectural Support for Programming Languages mailing list https://patchwork.kernel.org/patc
& Operating Systems (ASPLOS), pages 113–122, 1989. h/8955801/, 2016.

[9] Daniel Bovet and Marco Cesati. Understanding the [23] Dave Hansen. Patch: x86: set TLB flush tunable
Linux Kernel, Third Edition. O’Reilly & Associates, Inc., to sane value. https://patchwork.kernel.org/
2005. patch/4460841/, 2014.

[10] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yan- [24] Xeon phi processor. http://www.intel.com/
dong Mao, M Frans Kaashoek, Robert Morris, Alek- content/www/us/en/processors/xeon/xeon-phi-
sey Pesterev, Lex Stein, Ming Wu, Yue-hua Dai, et al. detail.html.
Corey: An operating system for many cores. In [25] Intel Corporation. Intel 64 and IA-32 Architectures
USENIX Symposium on Operating Systems Design & Software Developer’s Manual. Reference number:
Implementation (OSDI), pages 43–57, 2008. 325462-057US, 2015. https://software.intel.c
[11] Austin T Clements, M Frans Kaashoek, and Nickolai om/en-us/articles/intel-sdm.
Zeldovich. RadixVM: Scalable address spaces for [26] Youngjin Kwon, Hangchen Yu, Simon Peter, Christo-
multithreaded applications. In ACM SIGOPS Euro- pher J Rossbach, and Emmett Witchel. Coordinated
pean Conference on Computer Systems (EuroSys), pages and efficient huge page management with Ingens. In
211–224, 2013. USENIX Symposium on Operating Systems Design &
[12] Joel Coburn, Adrian M Caulfield, Ameen Akel, Implementation (OSDI), pages 705–721, 2016.

38 2017 USENIX Annual Technical Conference USENIX Association


[27] Yong Li, Rami Melhem, and Alex K Jones. PS- consistency on highly-parallel shared-memory multipro-
TLB: Leveraging page classification information cessors. Courant Inst. of Math. Sci, 1987.
for fast, scalable and efficient translation for future [40] Gil Tene, Balaji Iyengar, and Michael Wolf. C4:
CMPs. ACM Transactions on Architecture and Code The continuously concurrent compacting collector.
Optimization (TACO), 9(4):28, 2013. ACM International Symposium on Memory Management
[28] Likai Liu. Parallel computing and the cost of TLB (ISMM), pages 79–88, 2011.
shoot-down. http://lifecs.likai.org/2010/ [41] Michael Y Thompson, JM Barton, TA Jermoluk, and
06/parallel-computing-and-cost-of-tlb.html, JC Wagner. Translation lookaside buffer synchroniza-
2010. tion in a multiprocessor system. In USENIX Winter,
[29] Daniel Lustig, Abhishek Bhattacharjee, and Margaret pages 297–302, 1988.
Martonosi. TLB improvements for chip multiproces- [42] Linus Torvalds. Splice: fix race with page inval-
sors: Inter-core cooperative prefetchers and shared idation. http://yarchive.net/comp/linux/zero-
last-level TLBs. ACM Transactions on Architecture and copy.html, 2008.
Code Optimization (TACO), 10(1), 2013.
[43] Volkmar Uhlig. Scalability of microkernel-based
[30] Ophir Maor. What is CPU affinity? https:// systems. PhD thesis, TH Karlsruhe, 2005.
community.mellanox.com/docs/DOC-1924, 2014. https://os.itec.kit.edu/downloads/publ_
[31] Ophir Maor. Mellanox BIOS performance tuning ex- 2005_uhlig_scalability_phd-thesis.pdf.
ample. https://community.mellanox.com/docs/ [44] Volkmar Uhlig. The mechanics of in-kernel synchro-
DOC-2297, 2015. nization for a scalable microkernel. ACM SIGOPS
[32] Marshall Kirk McKusick, George V Neville-Neil, and Operating Systems Review (OSR), 41(4):49–58, 2007.
Robert NM Watson. The design and implementation [45] Ahsen J Uppal and Mitesh R Meswani. Towards
of the FreeBSD operating system. Pearson Education, workload-aware page cache replacement policies
2014. for hybrid memories. In International Symposium
[33] Jinzhan Peng, Guei-Yuan Lueh, Gansha Wu, Xiaogang on Memory Systems (MEMSYS), pages 206–219, 2015.
Gou, and Ryan Rakvic. A comprehensive study [46] Theo Valich. Intel explains the Core 2 CPU
of hardware/software approaches to improve TLB errata. The Inquirer http://www.theinquirer.net/
performance for Java applications on embedded inquirer/news/1031406/intel-explains-core-
systems. In ACM Workshop on Memory System cpu-errata, 2007.
Performance and Correctness (MSPC), pages 102–111,
2006. [47] Brian Van Essen, Henry Hsieh, Sasha Ames, and Maya
Gokhale. DI-MMAP: A high performance memory-
[34] Aristeu Rozanski. VM-scalability benchmark map runtime for data-intensive applications. In
suite. https://github.com/aristeu/vm-scalabil IEEE International Workshop on Data-Intensive Scalable
ity, 2010. Computing Systems (SCC), pages 731–735, 2012.
[35] Anand Lal Shimpi. AMD’s B3 stepping Phenom [48] Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova,
previewed, TLB hardware fix tested. AnandTech Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho
http://www.anandtech.com/show/2477/2, 2008. Navarro, Adrian Cristal, and Osman S Unsal. DiDi:
[36] Kirill A. Shutemov. mm: map few pages around Mitigating the performance impact of TLB shoot-
fault address if they are in page cache. Linux downs using a shared TLB directory. In ACM/IEEE
Kernel Mailing List, https://lwn.net/Articles/ International Conference on Parallel Architecture & Com-
588802, 2014. pilation Techniques (PACT), pages 340–349, 2011.
[37] Kirill A. Shutemov. unixbench.score -6.3% [49] Carl A. Waldspurger. Memory resource management
regression. Linux Kernel Mailing List, in VMware ESX server. In USENIX Symposium on
http://lkml.kernel.org/r/20160613125248. Operating Systems Design & Implementation (OSDI),
[email protected], 2016. volume 36, pages 181–194, 2002.
[38] Patricia J Teller. Translation-lookaside buffer consis- [50] wrk. HTTP benchmarking tool. https://github.c
tency. IEEE Computer, 23(6):26–36, June 1990. om/wg/wrk, 2015.
[39] Patricia J Teller, Richard Kenner, and Marc Snir. TLB

USENIX Association 2017 USENIX Annual Technical Conference 39

You might also like