DPTree - Differential Indexing For Persistent Memory
DPTree - Differential Indexing For Persistent Memory
421
DRAM-resident part is a radix tree while the PM-resident
DRAM part is a linked list of leaf nodes. Inserts/updates are im-
query query
plemented as upserts to the buffer tree. If the key is not in
insert/delete/
update Buffer Tree Cras
h
Base Tree the buffer tree, it is inserted into the tree. Otherwise, its
in-pla -consiste value is overwritten. Deletions are implemented as upserts
ce m n
ap
pe erge t
nd as well using tombstone record : one bit of value is reserved
...
... to indicate whether the corresponding key is deleted. Point
Coarse-versioned Leaf Layer
query first searches the buffer tree. If the key is found, the
Write-Optimized Adaptive Log PM
query returns the latest value. Otherwise, the base tree is
Figure 1: DPTree Architecture consulted. For range queries, both trees are searched, and
results are merged.
key-value entries could be inserted into a leaf node in PM DPTree maintains a merge threshold R ∈ (0, 1) for the size
with only one update to the metadata. Based on this idea, ratio between the buffer tree and the base tree (e.g., 1:10).
we design a two-level persistent index partly inspired by When the size ratio exceeds the threshold, a merge process
the dual-stage index architecture [32]. The first level (called starts. We discuss the merge process in detail in section 2.3.
buffer tree) is in DRAM and the second level (called base As an optimization, we maintain for the buffer tree a bloom
tree) is in PM. Writes are first absorbed by the fast and filter similar to the dual-stage index architecture [32] to filter
small buffer tree backed by a write-optimized PM redo log out searches for keys that are not in the buffer tree.
for durability. When the buffer tree reaches a size threshold,
we exploit the byte-addressability of PM to merge the buffer
tree into the base tree in-place, and the changes are then 2.1 Buffer Tree
flushed to PM with little fencing overhead. This way, the The buffer tree is an in-DRAM B+ -Tree whose durabil-
SMO is amortized over a batch of updates. Besides, the base ity is guaranteed by a PM write-ahead log. Since buffer
tree is heavily read optimized to reduce the search latency tree is always rebuilt on recovery, the log only stores the
introduced by the two-level design. Experiments show that logically upserted key-value entries which usually results in
DPTree reduces PM writes by a factor of 1.7x-3x compared less storage footprint than traditional redo/undo logging for
to state-of-the-art counterparts and meanwhile maintains a page-based storage.
comparable or better search latency. We summarize the ... freelist0
contributions as follows: free0 free1 Csize
... freelist1
• We present DPTree, which consists of the following Caching Log Page Allocator
designs: 1) A write-optimized adaptive PM log for
the buffer tree. 2) A novel coarse-grained version- validity bit Attach to freelist1when freed
8-byte
Log Page
ing technique for the base tree that provides crash-
consistency and meanwhile enables concurrent read. 1 1 1 1 1 1 0 ... next
3) An in-place crash-consistent merge algorithm based
kv1 kv2
on the versioning technique that reduces writes. 4) A
hashing-based in-PM leaf node design that reduces PM
reads while preserving efficient range scan. first curp poff
...
• Apart from the single-threaded DPTree, we propose PLog
additional designs to make DPTree concurrent and Modied but not ushed
Valid Invalid
scalable in terms of read, write, parallel merge, and
parallel recovery.
Figure 2: Buffer Tree Log Layout
• We show that DPTree achieves superior performance Write-Optimized Adaptive Logging We assume that
in terms of read and write compared to state-of-the-art the key-value entries are fixed-length. Our goal is to update
structures in our experiments on realistic workloads. the B+ -Tree durably using a PM log. One naı̈ve way to im-
The rest of the paper is organized as follows: Section 2 plement such a log is to maintain a persistent linked list of
details the design of DPTree. Section 3 discusses the con- log records. This requires at least two persists for every
current version of DPTree. Section 4 discusses the experi- append operation: one for the log record and one for the link
mental results. Section 5 presents the related work. Finally pointer. Instead, we use an idea similar to the Mnemosyne
Section 6 concludes the paper. system [27], leveraging the 8-byte failure-atomic write in
modern hardware. First, a log buffer is allocated, and one
bit is reserved for every 64-bit word as the validity bit. We
2. DIFFERENTIAL PERSISTENT TREE initially set the validity bit of every word to be 0. We se-
As Figure 1 shows, DPTree consists of two tree structures, rialize a log record (key-value entry) into multiple 64-bit
namely buffer tree and base tree. The buffer tree is DRAM- words, each holding 63-bit information and a validity bit of
resident and is optimized for updates. We design an adap- 1. These words are then written into the log buffer and per-
tive write-optimized logging scheme for the buffer tree for sisted. When an entry fits in a cache line after serialization,
durability. With this log, writes of key-value payload whose this log-append requires only one persist. When reading
log entry fits in a cache line to the buffer tree require only the log on recovery, we can detect torn writes by checking
one persist to the PM. The base tree is read-only and em- the validity bits of log records. One flaw of this method is
ploys selective persistence scheme used in [31, 23] where the that it requires initialization, doubling the number of PM
422
writes. Mnemosyne proposes to set the buffer fix-sized and technique. We maintain in every leaf node two sets of meta-
reuse the buffer after it is full by flipping the meaning of the data and a key-value slot pool, as shown in Figure 3. Each
validity bit(e.g., 1 for invalid and 0 for valid). However, this metadata consists of a bitmap, a next pointer, the max key
would not work in our case as the buffer tree could grow in the node, the number of key values in the node, an or-
in capacity. Hence the log buffer needs to grow as well. der array, and an array of fingerprints. The bitmap tracks
To solve this problem, we organize the log as a persistent the allocation status of the key-value slots. The order array
list of log pages and employ a caching allocation scheme. As stores the ordered indices to the key-value slots similar to
shown in Figure 2, the linked list includes a persistent header wB+ -Tree [9] to accelerate range scan. The key-value pool is
PLog, which includes a persistent pointer to the first page, organized by hashing, and collisions are resolved using linear
a volatile pointer to the current page and a volatile offset probing. The fingerprints array stores one-byte hashes of the
within the current page. Our log buffer allocator maintains keys stored at the corresponding positions of the key-value
two freelists of pages, namely freelist0 and freelist1 : one with pool. It serves to filter out unnecessary PM reads. To sup-
records whose validity bits are set to 0 and the other set to port deletion in linear probing hash table, we use two bits in
1. When a buffer tree is created, its log chooses one of the the bitmap for each slot: 00 for empty slot, 01 for occupied
two freelists for allocation based on a boolean-valued version slot, and 10 for deleted slot. Hash probing continues until ei-
number that is flipped after each merge. When the chosen ther a matching key or an empty slot is found. A persistent
freelist runs out of pages upon allocation request, it is filled global version number gv is maintained to indicate which
with pages from the system allocator. After a merge for metadata is active in every leaf node. gv also denotes the
the buffer tree completes, the full pages in the chosen freel- freelist for the current buffer tree discussed in section 2.1.
ist are transferred to the other freelist for the next buffer Merge handles updates to a key-value entry out-of-place but
tree. Using this allocation scheme, we can reuse most of the mostly within a leaf. That is, we find an empty/deleted slot
existing pages and pay the persistence overhead only when within the leaf and update the inactive metadata. When
the cached free pages are in shortage. To keep freelist size the merge completes, gv is flipped. Therefore gv combined
reasonable when index shrinks due to deletions, we dynam- with dual metadata distinguishes the before-image from the
ically maintain the cache capacity Csize to be about the size after-image of the leaf list with low overhead. Queries first
of the buffer tree before triggering a merge. Since the buffer get a snapshot of gv and traverse from the DRAM radix
tree grows or shrinks at the rate of R at most, the cache tree down to a specific leaf node. In the leaf node, the keys
size adapts to this rate as well. To amortize the persistence are examined with the help of the metadata of the snapshot
overhead introduced by the maintenance of the freelists, we version.
can increase the page size. Partial Persistence. One interesting aspect to note is
that the leaf node is kept partially persistent. That is, the
2.2 Base Tree count, order, and fingerprints arrays are modified on merge
but not persisted. We choose this design based on the ob-
servations that system failures are rare, and these auxiliary
gv(global version), a switch indicating which metadata data could be rebuilt from the rest of the crash-consistent
0 data. We do not rebuild these metadata immediately at
is active in every leaf
restart as this slows down recovery. Instead, the rebuild-
ing is completed cooperatively by query or merge after the
Meta0 Meta1 KV1 KV2 ... KVN
restart. Specifically, we maintain a persistent global counter
N key indices and one-byte hashes restartCount to indicate the number of restarts. Inside every
restart count
leaf node metadata, we keep a local restart count. At restart,
bitmap next max_key rc cnt order fingerprints restartCount is incremented persistently. When a query or
merge locates a leaf, the local count is checked against the
crash-consistent reconstructible global one. If they do not match, the metadata is rebuilt,
and the local restart counter is updated persistently.
Figure 3: Base Tree Leaf Layout
2.3 Crash Consistent Merging
In DPTree, the base tree is read-only and updated only Since the base tree is in PM, the merge needs to be crash-
by merge. Therefore we optimize it in terms of read and consistent. A straightforward way to implement merge is
merge. To take advantage of DRAM-PM architecture, we similar to LSM [22]: merge buffer tree and base tree into a
employ selective persistence scheme in the base tree where new base tree. However, this method has a write amplifica-
1
the internal nodes are in DRAM while the leaves are in PM. tion of R in the worst case. To maintain fast recovery, typi-
Two consecutive leaves are relatively ordered while the leaf cally, the merge threshold R is chosen to be a small number,
itself may not, and the leaf list is coarse-grained versioned e.g., 0.1, resulting in high write amplification. The source of
for persistence. write amplification is the out-of-place merge style optimized
Coarse-Grained Versioning Leaves. The CDDS B- for HDD/SSD but not for PM. Therefore, we propose to re-
Tree [26] proposes to version key-value entry at every write duce writes by exploiting the byte-addressability of PM and
operation to achieve crash consistency. However, such fine- merging the buffer tree into the base tree in-place.
grained versioning incurs large overheads. Inspired by the To maintain crash-consistency during merge, we exploit
versioning technique, we make the following observation: the coarse-grained versioning technique and follow a sim-
the before-merge and after-merge image of the base ple principle – never modify the data of the current version
tree provide a natural versioning boundary. Based on indicated by the global version number gv. As shown in Al-
this observation, we propose the coarse-grained versioning gorithm 1, our merge algorithm consists of 7 phases. Given
423
Algorithm 1: BaseTreeMerge(tree, sit, eit) Algorithm 2: LeafUpsertMerge(prevLeaf,
input : [sit, eit): a pair of iterator ranging the curLeaf, cv, nv, E curLeaf )
ordered entries from buffer tree input : E curLeaf : ordered key-value entries whose key
1 cv ← get version([Link]), nv ← ≤ [Link][cv].max key
¬get version([Link]) output: last leaf after merge
2 set bit and persist([Link], in merge) 1 [Link][nv] ← [Link][cv]
3 /* phase-1: upsert merge */ 2 if [Link][nv].rc != restartCount then
4 prevLeaf ← [Link][cv] 3 - reconstruct metadata of version nv
5 curLeaf ← [Link][cv].next, it ← sit 4 [Link][nv].rc ← restartCount
6 while curLeaf != null do 5 freeslots ← LeafCapacity - [Link][nv].count
7 E curLeaf ← non-tombsone entries in [it, eit) 6 if [Link] ≤ freeslots then
whose key ≤ [Link][cv].max key, 7 - Upsert entries by hashing and merging using
moving it accordingly bitmap and order array of version nv
8 prevLeaf ← LeafUpsertMerge(prevLeaf, 8 return curLeaf
curLeaf, cv, nv, E curLeaf ) 9 else
9 curLeaf ← [Link][cv].next 10 NL ← create leaves filled with entries from
10 end merging E curLeaf with curLeaf at filling rate
11 - create leaves after [Link][nv].next if FR
it! = eit 11 [Link][nv].next ← [Link]
12 /* phase-2: delete merge */ 12 [Link][nv].next ← [Link][cv].next
13 curLeaf ← [Link][nv].meta[nv].next, it ← sit 13 return [Link]
14 while curLeaf != null ∧ it != eit do 14 end
15 E curLeaf ← non-upsert entries in [it, eit) whose
key ≤ [Link][nv].max key, moving it
accordingly
16 LeafDeleteMerge(curLeaf, nv, E curLeaf ) Algorithm 3: LeafDeleteMerge(leaf, nv, kvs)
17 curLeaf ← [Link][nv].next input: kvs: ordered key-value entries whose key ≤
18 end [Link][nv].max key
19 /* phase-3: consolidation */ 1 i ← 0, j ← 0, leafKeyCount ← [Link][nv].count
20 - For leaves of version nv with occupancy < 2 while i <leafKeyCount ∧ j <[Link] do
T ∈ (0, 0.5], merge them with neighboring leaves 3 leafKeyIdx ← [Link][nv].order[i]
21 /* phase-4: flush to PM */ 4 leafKey ← [Link][leafKeyIdx].key
22 For each leaf of version nv, flush cache lines 5 if leafKey <kvs[j].key then
containing newly inserted entries and the crash 6 i++
consistent metadata without using any fences. 7 else if leafKey = kvs[j].key then
23 /* phase-5: rebuild volatile internal nodes */ 8 [Link][nv].bitmap[leafKeyIdx] ← deleted
24 /* phase-6: flip the global version number */ 9 else
25 set and persist([Link], nv) 10 j++
26 /* phase-7: garbage collection */ 11 end
27 - Delete leaves that can be reached through 12 end
[Link][cv] but not through [Link][nv] 13 - reconstruct meta[nv].order and meta[nv].count
a pair of iterator ranging the ordered key-value entries from Otherwise, node split is needed. Figure 4 shows a before
the buffer tree, our merge algorithm starts by getting the and after image of a node split. We first create as many new
current version cv from the global version number gv and leaves as needed to fill in the merged entries between curLeaf
computes the next version nv (line 1). We then persistently and E curLeaf . Then new leaves are linked using metadata of
set the in merge bit of gv to indicate the presence of a merge version nv in prevLeaf. Since we only modify the metadata
(line 2) which is used in recovery. Then the algorithm exe- of the inactive version nv and the unused key-value slots in
cutes the following 7 phases. each leaf, even if the system crashes during the split, we are
Upsert Merge Phase. We enumerate the leaves of ver- still able to recover the data structure via the links of ver-
sion cv (line 4-11) to perform the merge. For every leaf sion cv. In addition, we do not fill the new leaves fully, as
denoted as curLeaf, we get the entries to be merged into this might result in excessive splits in future merges. This
curLeaf by extracting from the input sequence the entries filling rate is denoted as a parameter F R ∈ (0.5, 1].
E curLeaf whose keys are no greater than the max key of Delete Merge Phase. Deletion is similar to upsert. The
curLeaf (line 7). Most of the heavy-lifting of the merge difference is that deletion only needs to modify the bitmap.
between curLeaf and E curLeaf is handled by LeafUpsert- As shown in line 13-18 of Algorithm 1, E curLeaf is extracted
Merge(line 8). As shown in Algorithm 2, it first makes from the merge sequence for each leaf node and LeafDelete-
a copy of meta[cv] into meta[nv]. Then it rebuilds metadata Merge is called. In LeafDeleteMerge(Algorithm 3), deleted
of version nv if the local restart counter does not match the entries are first found similar to merge sort. Then the bits
global one(line 2-3) and updates the local restart counter. If of deleted entries in the bitmap are marked(line 8). Lastly,
there are enough free slots for the incoming entries, they are the order array and count of version nv are reconstructed
merged using the order array and bitmap in meta[nv] (line using the updated bitmap and entry pool.
6-8). An illustration of such a case is shown in Figure 5. Consolidation Phase. To maintain a reasonable node
424
EcurLeaf K1,K2,K3... Old links and data are perserved
EcurLeaf K1,K2,K3... prevLeaf [Link]
curLeaf
[Link]
nextLeaf
curLeaf
EcurLeaf 4, 5, 7 [Link]
[Link] EcurLeaf 4, 5, 7 [Link]
1. copy over 2. Merge guided by [Link]
and hashing. next
prev curLeaf next prev
M1 4 2 9 M0 M1 4 4 2 9 7 5
[Link] max_key Data of old version is intact.
max_key count rc order bitmap count rc order [Link]
bitmap [Link]
next 9 3 1 2 0 3 fingerprints next 9 5 1 2 1 5 4 3 fingerprints
3. M1is updated accordingly.
(a) Node Before Merge
(b) Node After Merge
Figure 5: Before and After Image of Leaf Upsert Merge(No Split)
occupancy, leaves of version nv having occupancy below a Algorithm 4: Recover(tree)
consolidation threshold CT ∈ (0, 0.5] are merged with adja-
cent nodes. They are merged into the next leaf if there are 1 in merge ← extract in merge bit([Link])
enough free slots. If not, entries are borrowed from the next 2 if in merge = 1 then
leaf until these two leaves have about the same occupancy. 3 - reclaim new leaves created during the merge.
Flushing Phase. As shown in line 22-25 of Algorithm 1, 4 clear bit and persist([Link], in merge)
we use flush instruction on every leaf of version nv to flush 5 - finish garbage collection
cache lines that contain newly inserted entries. We skip 6 - replay PLog sequentially to reconstruct the buffer
cache lines that contain only newly deleted entries because tree.
deletion requires only the bitmap to be flushed. Then the 7 inc and persist(restartCount)
crash-consistent metadata are flushed to PM. Notice we do
challenges for memory allocation. For example, the system
not use fences to order these flushes because the one mfence
could crash in between a region allocated from PM alloca-
in Global Version Flip phase will guarantee these writes are
tor, and the address of that region is persistently stored in
persisted to PM, reducing the need for fences significantly.
the user’s data structure, resulting in PM leaks. To avoid
Rebuilding Internal Nodes. Then the internal nodes
this, we assume that the system is equipped with PM alloca-
are rebuilt from the leaves of version nv using the max key
tor with following APIs [24]: reserve, activate, and free.
and persistent address of each leaf. As our leaf nodes are up-
The allocator first reserves a region of PM per user request
dated infrequently, maintaining the max key is very cheap.
via the reserve API. The user then provides a PM location
Moreover, keeping the max key also speeds up recovery as
to persistently receive the address of the reserved memory
the amount of data read from PM is drastically reduced.
region through the activate API. Only after the activate
Global Version Flip. Once all the above phases fin-
returns can we consider the allocation complete. On deallo-
ished, the global version number gv is flipped and persisted
cation, the user provides the same PM location to the free
to PM. This officially concludes the merge and in effect, frees
API. The allocator then failure-atomically frees up the mem-
up the space occupied by the old entries for the next merge.
ory region and sets the PM location to contain a pre-defined
Garbage Collection. During the above phases, garbage
value(e.g., null). This ensures that user programs can avoid
leaves might be generated and, therefore, require reclama-
double freeing the memory region by checking the persistent
tion. For example, leaf split renders the old leaf obsolete.
pointer against the pre-defined value.
The identification of garbage nodes is simple. We denote
Recovery. On system restart, the index needs to recover to
Lcv as the set of leaves obtained through the links of version
a consistent state. The recovery procedure shown in Algo-
cv. Similarly Lnv is obtained through the links of version
rithm 4 first determines if the crash happened during merge
nv. Then the garbage leaves are simply Lcv − Lnv . Note
by examining the in merge bit of the global version num-
that garbage collection is placed after the version flip. This
ber [Link]. If the bit is set, leaf nodes created during the
is because only when the flipped global version reached PM,
interrupted merge require garbage collection. To be able
can we safely reclaim space. To handle crashes during recla-
to reclaim them, we persistently store the address of every
mation, we traverse the leaves of version cv backwards and
new leaf in a persistent linked list [Link] before
delete every node garbageLeaf that is in Lcv − Lnv using the
linking them in the leaf list during merge. Upon successful
address stored in the predecessor of garbageLeaf. In this way,
merge, the list [Link] itself is destroyed persis-
we can re-execute the reclamation during recovery freely.
tently using the free API of the system allocator. When an
2.4 PM Management & Recovery interrupted merge is detected, we reclaim those leaves by
traversing [Link] and deleting pending leaves
PM Allocation. The non-volatility of PM introduces
425
8-byte validity bit
using addresses stored in the list node through free API of Log Page
the system allocator. Then the algorithm continues the GC 1 1 1 1 1 1 0 ... off prev
process if needed. This includes re-run of the garbage recla-
mation and persistently destroying the [Link] kv1 kv2 Modied but
itself. Then the buffer tree is rebuilt from PLog. At the last Log Header Table not ushed
step, restartCount is incremented. ...
partitions
Valid
... tail head
N
2.5 Managing Complexity Invalid
It might seem complicated to equip each index structure ...
with a standalone log, though such approach is used in sev-
eral works [9, 23]. However, the standalone log is essen- Figure 7: Concurrent Buffer Tree Log Layout
tial for providing durability with low overhead. One might
consider reducing the complexity by keeping a single log
for multiple index instances. However, multiple instances
might interfere with each other and trigger merges unneces- Algorithm 5: ConcurrentLogAppend(tree, r)
sarily, and therefore increase the complexity. Instead, one
could hide the complexity via encapsulation since DPTree 1 p ← hash([Link]) % N
provides the same interfaces as other indices, which makes 2 restart: tailp ← [Link] headers[p]
integration with real systems easier. 3 goto new page if tail is null
4 off ← FetchAndAdd([Link], sizeof(r))
5 if off + sizeof(r) ≤ StorageAreaCap then
3. CONCURRENT DPTREE 6 - write and persist r into [Link] at offset
Multi-core architectures are the pervasive now. To un- off
leash the full potential of multi-core processors, efficient de- 7 else
sign of concurrent operations on persistent index structures 8 new page: lock partition p
is crucial. Therefore, we address the design of concurrent 9 newp ← [Link]([Link])
DPTree. 10 [Link] ← 0, set and persist(&[Link],
DRAM tailp)
query query query
11 [Link](&[Link] headers[p], newp)
Middle CPBT 12 - unlock partition p
insert/delete/ Base Tree
update Front CPBT
(read-only)
(read-only) 13 - atomically increment the size of buffer tree by
Para
llel M
the number of log records per page
erge
... ... 14 goto restart
...
...
... ...
Concurrent Partitioned Persistent Log
Coarse-versioned Leaf Layer PM allocator as well. As can be seen from Figure 7, the log has
a header table of size N with each slot storing the address of
Figure 6: Concurrnet DPTree Architecture the tail page of each list. The next log record will be written
into the tail page. Each log page is fixed-size consisting of
The concurrency scheme for DPTree can be described in metadata and a storage area. The off in metadata is the
three parts: the concurrent persistent buffer tree(CPBT), writing position of the next log record within the storage
parallel merge, and the synchronization between tree com- area. The prev pointer points to the previous log page. A
ponents. Figure 6 shows the architecture of concurrent DP- page with null value in prev field indicates the head of the
Tree. Concurrent writes are handled by the Front CPBT. linked list. Therefore, the recovery procedure starts from
When the Front CPBT is full, it becomes read-only and this head page and walks back to the tail page. To reduce
named as Middle CPBT. A new CPBT is created to be the PM writes, off is only modified but not persisted. This
Front CPBT to serve new writes. The base tree then merges is tolerable because, upon recovery, valid records could be
with the Middle CPBT in parallel in the background. Reads found by examining the validity bits in each log record.
consult in sequence of Front CPBT, Middle CPBT (if any), The append algorithm for the log is shown in Algorithm 5.
and finally base tree. Reads finish when a matching key is To append records concurrently, writers find the tail page
found in any of the three components. of a partition through the header table and use an atomic
FAA(fetch-and-add) instruction on the off field to claim the
3.1 Concurrent Persistent Buffer Tree writing position and space for its record. If the claimed offset
The concurrent persistent buffer tree consists of an in- is within the capacity of the log page, the record is written
DRAM concurrent B+ -Tree and a PM log. Since there and persisted(line 6). Otherwise, a new log page is allo-
are many studies on concurrent DRAM indices, such as cated to be the new tail page for a partition. A partition
Masstree [19], BwTree [17] and ART [16], we employ OLC lock protects this allocation. The caching allocator is mod-
(Optimistic Lock Coupling) [16] in our B+ -Tree as the syn- ified to support the two-step (i.e., reservation + activation)
chronization mechanism for its simplicity and efficiency. The allocation scheme. Though we use locks here, it turns out
main challenge, however, is to make the log persistent, con- not to be a bottleneck in our experiments. This is because
current, and scalable for recovery. the hot code path is the log record writing, which is already
Concurrent Logging We solve this by hash partitioning lock-free. Finally, we update the size of the buffer tree at
the log on key into N smaller partitions. Each partition the end of a page allocation (line 13). This, in effect, makes
is implemented as a concurrent persistent linked list of log the buffer tree size approximate and prevents the shared size
pages. And each partition is equipped with a caching page counter from becoming a bottleneck.
426
Concurrent Modification. We now describe the algo- Algorithm 7: ConcurrentUpsert(key, value)
rithm of CPBT upsert (insert, delete, and update are all
implemented as upsert). To upsert key X, a log record is 1 - seq-cstly set WFW .active
written first before the B+ -Tree is modified. However, the 2 WFW .ptr ← GPF
order of records of X in the log must match the order of 3 - upsert (key, value) into WFW .ptr
modifications of X applied to the B+ -Tree. This implies 4 FrontCPBTSize ← WFW .[Link]
that the log appending and tree modification for an upsert 5 WFR ← 0
operation should be an atomic operation. Consider the case 6 if FrontCPBTSize ≥ MergeThresh then
where one thread is inserting key X, and another is deleting 7 - spin while MergeState = premerge
X. If these two steps were not an atomic operation, it would 8 if CAS(MergeState, nomerge, premerge) then
be possible that in the log, the deletion of X is ordered be- 9 F 0 ← create a new CPBT
fore the insertion of X while in the B+ -Tree the insertion is 10 F ← GPF
ordered before the deletion. On crash recovery, the recon- 11 - seq-cstly set GPM with F
structed B+ -Tree will be different from the one before the 12 - seq-cstly set GPF with F 0
crash. To prevent this, we utilize the write lock in the leaf 13 SpinUntilNoRef(TYPE WFW , F )
node of the Optimistic Lock Coupling B+ -Tree. On every 14 MergeThresh ← GPB .size * R
upsert, we first acquire the lock in the B+ -Tree leaf, then 15 - seq-cstly set MergeState with merging
concurrent logging is performed, followed by modification of state
the leaf, and finally, bloom filter is updated. This correctly 16 - spanw a new thread to run following code:
serializes conflicting upsertions of the same key . 17 - ParallelTreeMerge(GPM , GPB )
18 - MP ← GPM
3.2 Synchronizing Tree Components 19 - GPM ← 0
20 - SpinUntilNoRef(TYPE WFR , MP )
Algorithm 6: ConcurentSearch(key) 21 - SpinUntilNoRef(TYPE WMR , MP )
22 - delete MP
1 - seq-cstly set WFR .active
23 - seq-cstly set MergeState with nomerge
2 WFR .ptr ← GPF
state
3 - search in WFR .ptr
4 WFR ← 0
5 - return value if found
6 - seq-cstly set WMR .active Algorithm 8: SpinUntilNoRef(W T ype, ptr)
7 WMR .ptr ← GPM 1 while True do
8 - search in WMR .ptr if not null 2 Wait ← false
9 WMR ← 0 3 for each thread t registered in SM do
10 - return value if found 4 W ← load sync word of t of type W T ype
11 - seq-cstly set WBR .active 5 if [Link] ∧ ([Link] = 0 ∨ [Link] = ptr)
12 WBR .ptr ← GPB then
13 - search in WBR .ptr 6 Wait ← true
14 WBR ← 0 7 end
15 - return value if found
8 - break if Wait = false
9 end
Concurrent DPTree has multiple components that need to
be synchronized. For eaxmple, after tree merge, GC should tree respectively. We assume accesses to shared word-sized
be performed only when there are no readers that have refer- values are atomic with acquire/release semantics(e.g., x86).
ences to the Middle CPBT. This essentially requires tracking Tracking Readers. The steps for concurrent search is
the number of active reader/writer. One naı̈ve way is to use shown in Algorithm 6. It starts by sequential consistently
globally shared counters [11]. However, previous studies [8, (abbreviated as seq-cstly) (e.g., using mfence in x86) setting
19] have shown that frequent writes to shared cache lines kill the active bit of WFR , declaring the entrance of a reader.
scalability due to high cache invalidation traffic. Instead, Then, the GPF is loaded into other bits of WFR , acting as
we choose to track the reader/writer in a decentralized way a reference to the data structure. Then the search is per-
similar to RCU [20]. We maintain four thread-local synchro- formed on Front CPBT. Finally the WFR is cleared, declar-
nization words: WFW , WFR , WMR , and WBR . WFR /WFW ing the exit of the reader. The access to Middle CPBT and
stores the address of the Front CPBT current reader/writer base tree is guarded similarly.
thread is accessing. WMR stores the address of the Middle Synchronization for Metadata Rebuild. The con-
CPBT current reader thread is accessing. WBR stores the current search in base tree is similar to the single threaded
root of the radix tree in base tree current reader thread is search because we never modify the data of previous version
accessing. during merge. However, after restart, the metadata in a leaf
One bit of each word is reserved as active bit indicating might require a rebuild which needs synchronization. This is
the presence of a reader/writer. The rest of the bits store a achieved by using one bit in the local restart counter of each
pointer value, acting as a reference. All threads register the leaf as lock bit. During search or merge, the local restart
addresses of the words to a synchronization manager(SM) counter is checked against the global one. If they do not
at startup or first access. We also maintain three global match, the lock is obtained and the metadata is rebuilt. To
pointers: GPF , GPM , and GPB pointing to current Front prevent persistent deadlocks, we clear lock bits on recovery.
CPBT, Middle CPBT and the root of the radix tree in base Tracking Writers & Merge Initiation. The steps for
427
concurrent upsert is shown in Algorithm 7. It employs sim- Algorithm 9: ParallelTreeMerge(MT, BT)
ilar guarding mechanism during the write to Front CPBT.
After the write, if the size of Front CPBT is full, a merge 1 P 1 ..P M ← partition BT and M T into N
task is scheduled by only one writer. This is achieved by independent merge partitions
atomically manipulating the M ergeState integer which has 2 /* pull-based task allocation */
three states that transitions in a cyclic order: nomerge, pre- 3 for each merge worker thread in the system do
merge, and merging. To ensure there is at most one on-going 4 p ← get next to-be-merged partition in P 1 ..P N
merge, threads spin until the M ergeState is not premerge 5 - perform U psertM erge, DeleteM erge and
before CAS’ing it to premerge. The winner gets to initiate Consolidation Phases on p
a merge task. It first sets GPM to the full Front CPBT 6 end
F , allowing new readers to perform reads on the temporary 7 - wait for all partitions to be processed
buffer tree. Then GPF is set to a new CPBT F 0 . Notice 8 - reconnect consecutive partitions in P
there might a short period of time where GPF = GPM (af- 9 for each merge worker thread in the system do
ter line 11 before line 12). In such a case, duplicate reads 10 p ← get next to-be-flushed partition in P 1 ..P N
are possible. However, it is still correct. 11 - perform Flush Phase on p
After the GPF (line 12) is updated, there still might be 12 end
writers left accessing the CPBT indicated by GPM . We 13 - rebuild the radix tree and update GPB atomically
must wait for these writers to finish before the GPM is used 14 - wait for all partitions to finish flushing and any
for merge(line 13). This is done by scanning the WFW in readers accessing old radix tree to exit using
each threads and spins as long as there are possible writers SpinUntilNoRef
left: there exists a WFW where the active bit is on and the 15 - perform global version flip
pointer value stored in the word equals to null or GPM . The 16 - perform garbage collection with one thread
test of active bit excludes non-writer threads. The test of
P i (1 ≤ i ≤ M ), we find in Middle CPBT the first key
null handles the race condition where writer threads have
P i .M KS≥minkey (P i .BLS). Similarly, we find in Middle
loaded the value of GPF but have not stored the value
CPBT the first key P i .M KE > maxkey (P i .BLE). The key
into WFW .ptr. Notice the sequential-consistent stores to
range denoted by [P i .M KS, P i .M KE) in Middle CPBT is
the thread-local words are crucial as they prevent the loads
to be merged into leaves denoted by [P i .BLS, P i .BLE] in
of global pointers being reordered before the stores to the
base tree. Since these partitions are non-overlapping, they
active bits, which might result in missing references during
could be merged in parallel without any synchronization.
checks. The last condition catches the writers currently ac-
These merge partitions are then scheduled to a pool of
cessing GPM . Since this synchronization mechanism is used
worker threads via a queue to balance the load on each
in multiple places, we abstract it into an general procedure
worker. Moreover, the P Size is set to be large enough to
SpinU ntilN oRef shown in Algorithm 8 which takes as in-
amortize the scheduling overhead but also small enough to
put a enum (possibly valued WFW , WFR , WMR , or WBR )
prevent each merge task to become a bottleneck of the entire
and a pointer value. Then we computes the next merge
merge. The specifics of the merge is similar to Algorithm
threshold (line 14), and M ergeState is changed to merging
1 with small modifications. The three phases UpsertMerge,
state(line 15). Finally a background thread is spawned to
DeleteMerge, and Consolidation are performed in parallel
execute the merge task.
for all merge partitions (line 3-6). After these phases are
In the merge worker, P arallelT reeM erge is called to per-
done and synchronized, there might be leaves newly created
form the real merge (see next section). When the merge
or deleted in some partitions. Therefore, we connect the con-
completes, the Middle CPBT should be destroyed to acceler-
secutive partitions through the next pointers(line 8). Then
ate reads. Similarly, there are possible readers left accessing
the Flushing Phase is also executed in parallel (line 9-12)
GPM . As shown in line(17-23) of Algorithm 7, after the
along with a rebuilding of the internal nodes. Similarly, we
merge, we first set GPM to null atomically, ensuring new
perform the version flip and GC at the last step.
readers go directly to the base tree. Then we use the proce-
dure SpinU ntilN oRef onWFR and WMR of each thread to 3.4 Optimizations for Word-Sized Value
wait for the readers to exit before the CPBT, indicated by
When the value of the key-value entry is word-sized, and
GPM , is destroyed. Notice we scan WFR as well because,
there is currently no merge (i.e., MergeState = nomerge),
though unlikely, a reader might already be reading the Front
we exploit the 8B failure-atomic write to implement up-
CPBT before the Front CPBT becomes the Middle CPBT
date/delete for keys not in the base tree without writing
and finish its read after the merge has completed. Lastly, the
PM log. Assuming the key being updated/deleted is X,
old Middle CPBT is destroyed and M ergeState is updated.
we first lock the leaf of the buffer tree where X will re-
side. Then we perform a search on the base tree for X.
3.3 Parallel Tree Merging If X is found, its value is updated failure-atomically. For
To reduce the blocking time introduced by base tree deletion, the tombstone bit is set, and X will be removed
merge, parallel merging is essential. Luckily, the merge in on the next merge. Otherwise, the update/delete opera-
our case is easy to parallelize. As shown in Algorithm 9, tion reports that X does not exist. The update/delete op-
we divide the leaves of base tree into partitions, denoted eration follows the same guarding mechanism used previ-
as P 1 ...P M . To be more specific, each partition P i con- ously. Specifically, it first obtains the reference to the Front
tains a constant P Size number of consecutive leaves of CPBT in WFW before checking MergeState. In this way,
base tree denoted by two endpoint leaves: P i .BLS and if M ergeState = nomerge, we are sure that the merge be-
P i .BLE. We also denote minkey (L) and maxkey (L) as the tween the Front CPBT to the base tree will not be initiated
smallest and greatest key in leaf L. For each partition (because we held a reference) before the update/delete is
428
# Flushes per Entry
Throughput[MOps]
Latency[us] 30 in-place
1 10
2 out-of-place
20
1 0.5 5
10
0 0 0
25M 50M 100M 25M 50M 100M 0.001% 0.01% 0.1% 16 32 48 64
Data Size Data Size Selectivity Entry Size(B)
(a) Insert (b) Lookup (c) Scan (d) Merge Policy
Figure 8: Single-Threaded Micro-benchmark
DPTree FASTFAIR FPTree WORT
complete, avoiding lost updates. DPTree/FPTree/FASTFAIR show very low allocation over-
head in Table 4.2 due to the page-based allocation.
4. EVALUATION Impact of Flush and Fence. To measure the perfor-
mance impact of flush and fence, we first performed 50M in-
In this section, we evaluate the DPTree against other
sertions without flushes and fences. Then we gradually add
state-of-the-art persistent index structures including FAST-
back these instructions. We notice a 1.59/1.34/1.83/1.45x
FAIR [12], FPTree [23], and WORT [15] on Intel Optane
latency increase for DPTree/FASTFAIR/FPTree/WORT,
DC Persistent Memory Modules.
respectively, when using only flush instructions. When both
4.1 Implementation & Setup flush and fence instructions are added, we see an additional
1.10/1.12/1.10/1.13x increase. These results indicate that
We implemented the DPTree and its concurrent version
the impact of flushes is more significant than fences, and
in C++11. We utilize an open-source OLC B+Tree 1 in the
they both should be minimized.
buffer tree. For FASTFAIR and WORT, we use their public
implementations 2 . Since FPTree is not open-sourced, we Table 1: PM Flush Stats per op of 50M Insert Workload
implemented it as faithfully as possible in C++11. For PM Index Sum Log Alloc Node Swap Fence
allocation, we use nvm malloc 3 that supports the reserve- DPTree 2.26 1.1 0.06 1.1 0 1.06
activate scheme. All source codes are compiled with g++6.4 WORT 6.8 0 3.9 2.9 0 6.5
FASTFAIR 4.35 0 0.15 4.2 0 4
with -O3. We ran the experiments on a Linux system (kernel FPTree 3.64 0.05 0.07 3.51 0 3.6
4.9) with two Intel(R) Xeon(R) Platinum 8263C CPU @ 2-Tier-BM 8.5 4 0.004 0 4.5 2
2.50GHz (26 physical cores) CPUs with 35MB Last Level
Point Lookup. We then perform random key searches
Cache, 376GB DRAM, and 512GB PM. The Optane DIMMs
varying data size from 25M to 100M. As shown in Figure 8b,
are configured in App Direct mode and exposed through
DPTree is up to 1.82x/1.85x/2.19x faster than FASTFAIR,
ext4-DAX file system. To avoid NUMA effects, we bind all
FPTree, and WORT. This is attributed to the efficient
threads to one socket.
lookup inside base tree leaf and the bloom filter, eliminat-
4.2 Micro-benchmark ing most of the searches(95%) for keys not in the buffer tree.
For PM-only indices, such as FASTFAIR and WORT, the
In this experiment, we evaluate the performance of differ-
cache misses to PM take up more overhead as the data size
ent index structures using a micro-benchmark. The keys are
grows. Thanks to the buffer tree and inner nodes of the base
64-bit random integers. For DPTree, we set the capacity of
tree being in DRAM, DPTree is able to complete the search
leaf to 256 entries, the merge consolidation threshold CT to
mostly with 2 accesses to PM.
0.34, the merge threshold R to 0.1, the filling rate F R of
base tree leaf to 0.7, log page size to 64KB. For FASTFAIR
Latency[us]
429
is up to 3.0/10.98x faster than FPTree/WORT. This is be- 4
Latency[us]
2
cause FPTree requires sorting, whereas DPTree avoids this
cost using order array during scan. WORT performs the 2 1
worst because it stores payloads randomly in PM, causing
low cache utilization. In contrast, DPTree stores payloads
0 0
inside a page for better locality. Compared to FASTFAIR, 1 .8 .6 .4 .2 1 .8 .6 .4 .2
which is inherently good at scan, DPTree is only up to 1.1x Buffer Space Ratio Buffer Space Ratio
slower due to the indirection through order array.
Space Consumption We then measure the mem- (a) Insert (b) Lookup
ory utilization for all indices. After 50M inser- DPTree 2TierBM
tions, the results are 1.7/1.26/1.28/6.02 GB for DP- Figure 11: DPTree vs. 2-Tier-BM: 50M 16B key value
Tree/FASTFAIR/FPTree/WORT respectively. DPTree pairs with uniformly distributed 8B keys
uses about 34% more space than FASTFAIR and FPTree be-
cause it maintains two sets of metadata. However, the space Tier buffer management approach proposed by Renen et
consumption mainly comes from the fingerprints and order al. [25] incorporates PM as an additional persistent layer be-
array, which are for accelerating search operations. More- tween DRAM and SSD/HDD. It also introduces several key
over, crash-consistency is still maintained without them. Af- optimizations enabled by persistent memory. As DPTree
ter removing these metadata, DPTree requires only about contains DRAM components too, we are interested in com-
5% more space than FPTree. On the other hand, WORT paring DPTree with this approach for indexing purposes.
takes much more space than other indices due to its small For a fair comparison, we implemented a BTree using 2-
radix and external allocation of payloads. Tier (DRAM and PM) buffer manager design with key op-
In-place Merge vs. Out-of-place Merge. To show timizations including cache-line grained pages, mini pages
that coarse-grained versioning indeed reduces PM writes, we and pointer swizzling. To guarantee the crash consistency
compare the in-place merge algorithm, which is enabled by of the BTree, we implement text-book style physiologi-
coarse-grained versioning, with the out-of-place merge. We cal redo/undo logging on PM. We configure the 2-Tier-
replace the leaf list in the original base tree with a continuous BM(Buffer Manager) to use 16KB pages and employ a
PM region where the entries are stored compactly in sorted hashing-based leaf. We perform 50M insertions followed by
order. We insert 5M random key-value entries and vary the 50M lookups of random 8B keys (the worst case for both
entry size from 16B to 64B. The number of flushes to PM indices). The results are shown in Figure 11. The x-axis is
per entry is shown in Figure 8d. We notice the out-of-place the percentage of leaves in DRAM. We fix DPTree’s merge
merge incurs up to 5.8x more PM flushes as the entry size threshold of R at 0.1. Therefore DPTree buffers at most 10%
increases. When the entry size decreases, the out-of-place of the data. To keep recovery fast, we set the log capacity
merge incurs fewer flushes as a cache line contains more of 2-Tier-BM to be 30% of the entire key-value set.
entries. However, it still requires 2.1x more flushes at least. We see that DPTree consistently outperforms 2-Tier-BM
Therefore, coarse-grained versioning is effective in reducing in insertion workload. On average, our implementation of
PM writes compare to traditional out-of-place merging. 2-Tier-BM requires 8.5 flushes and 2 fences per insertion
Different Leaf Search Schemes. In this experiment, at 100% buffer ratio compares to the constant 2.2 flushes
we compare our hybrid linear probing with fingerprints with of DPTree. This is because, in physiological logging, every
two other search schemes: linear-probing-only, fingerprints- update requires logging the PM address, before and after
only(FPTree style). We first populate DPTree with 50M image of the value in that address. This results in more
entries varying from 16B to 64B and perform 50M suc- data to be logged compare to DPTree, and incurs 4 flushes
cessful/unsuccessful point lookups. The average latency of per update shown in Table 4.2. On the other hand, logging
lookups is shown in Figure 10. For successful lookups, lin- more data results in more frequent checkpoints (because log
ear probing with fingerprints causes the lowest latency as the is full) which swaps out dirty pages. This results in around
entry size increases. The fingerprints-only probing scheme 4.5 flushes per op on average. As the buffer ratio decreases,
performs the worst in this case as it needs to check half of latency degrades because of the increased number of expen-
the fingerprints on average. The linear-probing-only scheme sive page swaps which results in flushes. For sorted leaf,
performs better than the fingerprint-only scheme. This is the flush overhead is much more pronounced (up to 35X
because the correct entry is near the start of the probing more flushes). This is because maintaining sorted order re-
position most of the time. However, the latency degrades as quires half of the key-value pairs to be moved and logged.
the entry size increases because of the increased number of For lookups, 2-Tier-BM outperforms DPTree by 50% only
cache lines probed. For unsuccessful lookups, the schemes when it is entirely in DRAM. However, its latency degrades
with fingerprints stand out. This is because they mostly rapidly due to more page swaps as the buffer ratio decreases.
only check fingerprints. Therefore they show only a slight
increase in latency as the entry size grows. The linear prob- 4.4 YCSB Workload
ing with fingerprints scheme reduces latency even further as
We then evaluate the concurrent indices using YCSB[10].
it requires much fewer fingerprint checks.
We vary the portions of index operations and generate the
following workload mixtures: 1) InsertOnly: 100% inserts
4.3 PM-Optimized Buffer Managers of 50M keys in total. 2) ReadOnly: 100% lookups. 3)
Traditional buffer managers utilize DRAM as a faster cache ReadUpdate: 50% reads and 50% updates. 4) ScanIn-
in front of the slower block-oriented storage device to im- sert: 95% scans and 5% inserts. 5) Mixed: 50% lookups,
prove performance. There have been works [25, 7, 14] on 20% updates, 10% scans and 20% inserts.
optimizing buffer manager for persistent memory. The 3- The average scan selectivity is 0.001% with a standard
430
Throughput [MOps]
8
40 1.5 10
6 20
4 1
20 10 5
2 0.5
0 0 0 0 0
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
Threads Threads Threads Threads Threads
(a) InsertOnly (b) ReadOnly (c) ReadUpdate (d) ScanInsert (e) Mixed
DPTree FASTFAIR FPTree
Figure 12: Throughput of YCSB Workloads with Random Key
4
DPTree-1thd smallest increase in metrics with 32 threads thanks to the
DPTree-32thd
efficient lookup and update (one PM write). FPTree per-
Time[s]
FPTree-1thd
2 FPTree-32thd
forms the worst in this case as its update procedure requires
at least three PM writes. The unsatisfactory scalability of
0 FPTree is also reflected in the table where cycles and instruc-
223 224 225 226 227 tion count rises significantly when the number of threads is
Data Size
at 32. This is also partly attributed to the increased abort
rate of TSX at high thread count.
Figure 13: Recovery Performance
ScanInsert and Mixed. For this workload, FASTFAIR
deviation of 0.0006%. For each workload, we first populate maintains the highest scan throughput across all thread count
the index with 50M random 64-bit integer keys and then run because of the sorted leaf design. DPTree shows similar
the corresponding workload. The result of each workload throughput to FASTFAIR and is superior throughput over
reported is the average of 3 runs. FPTree by 2-3x when thread count is less than 20 thanks to
We use the default settings for FPTree and FASTFAIR the order array. However, DPTree plateaus after 20 threads
described in the original papers[23, 12]. We notice that because the random access inside large leaf requires more
FASTFAIR[12] did not address concurrent update. Thus, PM bandwidth. For Mixed workload, DPTree, FASTFAIR,
we add our implementation: to perform a search first and and FPTree scale to 14x, 15.92x, and 8.2x, respectively.
then lock the leaf node and update the pointer in the leaf DPTree maintains the highest throughput among all thread
using 64-bit failure-atomic write. For concurrent DPTree, count, showing the ability to scale under various access pat-
we set the number of log partition to 64, the merge thresh- terns.
old R to 0.07, the bloom filter error rate to 0.03, the base
tree leaf node size to 8KB, the log page size to 64KB. The Table 2: CPU Stats of Workloads with Random 8B Keys
results are shown in Figure 12. Index Cycles Insts L3Miss BrMiss
ReadUpdate 1 thread/32 threads per op
InsertOnly. For this workload, DPTree consistently out-
DPTree 1768/2947 456/474 3.05/3.25 2.34/2.50
performs others in terms of throughput. DPTree shows FASTFAIR 2523/4436 696/703 9.07/9.16 7.77/8.29
similar scalability to FASTFAIR with up to 1.36x higher FPTree 3581/15062 730/1624 5.25/7.93 16.22/24.13
throughput at 32 threads. FPTree starts with similar scala- ReadOnly 1 thread/32 threads per op
DPTree 1461/1662 444/447 2.68/2.82 1.76/1.80
bility but degrades after 20 threads due to frequent transac- FASTFAIR 1931/2552 634/651 7.55/7.69 6.95/7.01
tion aborts. Since the Optane DIMM has a write bandwidth FPTree 1876/2615 662/781 3.59/4.26 14.96/15.56
1/6 that of DRAM [13], all indices plateau at high thread
count. However, DPTree allows more concurrency to satu- 4.5 Recovery
rate the bandwidth due to reduced PM writes per operation. In this section, we evaluate the recovery performance and
ReadOnly. For ReadOnly workload results shown in Fig- scalability of DPTree against other index structures. To
ure 12b, DPTree stands out with 1.32-1.58x and 1.37-1.7x measure the impact of data size in recovery, we vary the
better lookup throughput than FASTFAIR and FPTree as number of key-value pairs from 223 to 227 . The setting for
thread count increases. With 32 threads, DPTree scales to DPTree and FPTree are the same as in section 4.4. We
27.3x whereas FASTFAIR and FPTree scale to 23.9x and also measure the parallel recovery performance of DPTree
21.6x, respectively. From the CPU statistics collected using and FPTree. For DPTree, the parallel recovery is simple:
Linux Perf tool in Table 2, DPTree has the slightest increase log partitions are read and replayed to the concurrent buffer
in CPU metrics when the number of threads increases to 32, tree in parallel. Since the FPTree paper did not mention
showing the scalability of our synchronization method. The parallel recovery algorithm, we implement our own version
other reason that DPTree has the highest scalability is that for FPTree in three phases: Phase 1, Traverse the leaf layer
the hash lookup inside leaf requires less PM bandwidth than of FPTree to collect leaves; Phase 2, Spawn multiple threads
the leaf scan in FPTree and FASTFAIR. to read the greatest keys out of these leaves in parallel; Phase
ReadUpdate. For this workload, DPTree, FASTFAIR 3, Reconstruct the internal nodes with one thread.
and FPTree scale to 21.9x, 18.1x, and 7.5x with 32 threads, The result is depicted in Figure 13. We omitted FAST-
respectively. Figure 12c shows that DPTree has the high- FAIR since it has instantaneous recovery. For recovery with
est throughput for all thread configurations, outperforming one thread, DPTree has a much better curve than FPTree
other indices by up to 1.7-5.07x. As shown in Table 2, DP- (about 1.8-5.5x less recovery time). This is because, during
Tree has the smallest overhead across all metrics and has the rebuilding, FPTree spends most of the time retrieving from
431
PM, the maximum key in each leaf. Whereas DPTree re- design suffers from inefficient range scan and high recovery
builds the buffer tree from the PLog, which is only a small cost due to the rebuilding of the entire B+ -Tree.
fraction of the complete key-value set. As for rebuilding Merge Trees. The well-known LSM-Tree[22] uses mul-
internal nodes of the base tree, it takes up negligible 0.2% tiple tree structures to form a logical index. DPTree adopts
in the total recovery time. This is because leaf nodes in the ideas of logging and leveling but differs from LSM-Tree
the base tree are large, and the maximum key is already in many ways. One one hand, DPTree employs only two
computed in leaf nodes, reducing PM reads significantly. levels to reduce read latency. The two-level design also sig-
For recovery with 32 threads, DPTree can achieve up to 8x nificantly reduces the worst case write amplification in LSM-
speedup over one thread, saturating the memory bandwidth Trees. On the other hand, DPTree employs in-place merging
of our system. FPTree, however, has only 2.6x speedup and optimized for PM instead of out-of-place merging optimized
up to 5.46x slower than DPTree. We find that with one for block-based storage. Another relevant work is the dual-
thread, in phase (1) and (3), FPTree spends 28% of the stage index architecture proposed by Zhang et al. [32]. DP-
time (each taking 26% and 2%). Phase (2) is able to scale Tree builds on top of this work but differs in several ways.
to 11x, which saturates the PM read bandwidth of the sys- First, their work targets indexing in DRAM without consid-
tem. However, the other two phases are hard to parallelize ering durability. Second, their work focuses on saving space
and turn out to be the bottlenecks. whereas DPTree focuses on optimizing index latency in such
DRAM-PM hybrid systems. Lastly, their work lacks designs
for concurrency which our work complements.
5. RELATED WORK Concurrent Log-Structured Stores. To unleash the
power of multi-core processors, researchers study concur-
DPTree is inspired by previous works on persistent in-
rent and scalable log-structured KV stores. cLSM [11] in-
dices, merge trees and concurrent KV stores.
creases concurrency of its memtable with reader/writer lock
PM-Only Persistent Indices. For PM-Only struc-
and concurrent skip list. However, it’s scalability is limited
tures, data is completely in PM, enabling the possibility
due to the use of shared counters. Nibble [21] improves the
of near-instant recovery. Venkataraman et al [26] proposed
scalability of in-memory KV store using techniques includ-
CDDS B-Tree that uses sorted leaf and fine-grained version-
ing partitioning, optimistic concurrent lookups, and multi-
ing for each entry to ensure consistency and enable concur-
head logs. However, it comes at the expense of range scan
rency. Such design requires extensive persists and garbage
ability and durability. Different from these works, DPTree
collection. To reduce persists, Chen et al. [9] proposed
addresses the scalability problem by carefully avoiding con-
wB+ -Tree that keeps leaf unsorted and relies on 8B atomic
tention on shared states and introducing concurrent parti-
write for bitmap and logging to achieve consistency. It also
tioned PM logs without sacrificing range scan and durability.
employs slot-array to accelerate searches. Similar to wB+ -
Tree, NV-Tree [31], a persistent B+ -Tree proposed by Yang
et al., employs unsorted leaf with bitmap and reduced persis- 6. CONCLUSION
tence guarantee of internal nodes. Internal nodes are rebuilt In this paper, we proposed DPTree, a novel index struc-
from the persistent leaves on recovery. This design requires ture designed for DRAM-PM hybrid systems. DPTree em-
sorting for range queries. FASTFAIR [12], a persistent and ployed novel techniques such as flush-optimized persistent
concurrent B+ -Tree by Hwang et al., propose FAST(Failure- logging, coarse-grained versioning, hash-based node design,
Atomic ShifT) and FAIR techniques to keep node sorted by and a crash-consistent in-place merge algorithm tailored for
exploiting properties of modern CPUs. The BzTree [6] by PM. As a result, DPTree reduced flush overhead consider-
Arulraj et al. employs PMwCAS (Persistent Multi-word ably for cache-line-sized key-value payloads. The combina-
CAS) [28] to achieve persistence and concurrency in B+ - tion of Optimistic Lock Coupling B+ -Tree, coarse version-
Tree. However, BzTree requires at least 9 persists per ing, and in-place merging enabled DPTree to be concurrent
insert (8 from PMwCAS). The failure-atomic 8B update and scalable in multi-core environment.
technique is also used to develop a persistent radix tree We conducted a comprehensive experimental evaluation
called WORT [15] proposed by Lee et al. However, radix of DPTree against state-of-the-art PM indices. We found
tree suffers from poor range scan performance. This is even that the single-threaded DPTree has the lowest flush over-
more pronounced in WORT since key-value payloads are head and highest search throughput among those evaluated.
not stored in-line. DPTree differs from these works in that For concurrent workloads, DPTree exhibits good scalability
it amortizes the persistence overhead of metadata, such as for basic operations. In terms of recovery, DPTree achieves
bitmap and slot-array, over a batch of updates. both better and more scalable performance than FPTree for
DRAM-PM Persistent Indices. For this kind of data single-threaded and parallel recovery.
structures, DRAM is used for auxiliary data that is rebuilt As future work, we consider querying the log directly with-
on recovery. Since DRAM has lower latency than PM, this out DRAM buffer, e.g., structuring the PM log as a persis-
scheme usually results in improved performance at the cost tent hash table for both buffering and durability, to reduce
of longer recovery time. FPTree [23], a persistent and con- the complexity of the design.
current B+ -Tree proposed by Oukid et al., employs such
design. The internal nodes are placed in DRAM and leaves
in PM. Bitmap, logging, and fingerprints are used for crash- Acknowledgement
consistency and reducing PM reads. Similar to NV-Tree, We would like to thank our anonymous shepherd and review-
FPTree has to perform sorting for range queries. DPTree, ers for their guidance and insightful comments on this work.
on the other hand, uses order-array to avoid sorting. An- We would also like to thank Dr. Bin Cui, Jinming Hu, and
other related work is HiKV[30] proposed by Xia et al. HiKV Dr. Xinyuan Luo for their useful suggestions. We would like
combines persistent hashtable and volatile B+ -Tree. Such to give special thanks to Alibaba Group for providing the
432
PM hardware and help for the work. This research is sup- Dulloor, J. Zhao, and S. Swanson. Basic performance
ported by National Basic Research Program of China(Grant measurements of the intel optane dc persistent
No. 2015CB352402), National Natural Science Foundation memory module, 2019.
of China(Grant No. 61672455), and Natural Science Foun- [14] H. Kimura. Foedus: Oltp engine for a thousand cores
dation of Zhejiang Province(Grant No. LY18F020005). and nvram. In Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data,
7. REFERENCES SIGMOD ’15, pages 691–706, New York, NY, USA,
[1] 3d xpointTM : A breakthrough in non-volatile memory 2015. ACM.
technology. [Link] [15] S. K. Lee, K. H. Lim, H. Song, B. Nam, and S. H.
en/architecture-and-technology/intel-micron- Noh. Wort: Write optimal radix tree for persistent
[Link]. memory storage systems. In Proceedings of the 15th
[2] Aerospike performance on intel optane persistent Usenix Conference on File and Storage Technologies,
memory. http: FAST’17, pages 257–270, Berkeley, CA, USA, 2017.
//[Link]/rs/229-XUE-318/images/ USENIX Association.
Aerospike Solution Brief [Link]. [16] V. Leis, F. Scheibner, A. Kemper, and T. Neumann.
[3] Intel R optaneTM dc persistent memory: A major The art of practical synchronization. In Proceedings of
advance in memory and storage architecture. the 12th International Workshop on Data
[Link] Management on New Hardware, DaMoN ’16, pages
30/intel-optane-dc-persistent-memory-a-major- 3:1–3:8, New York, NY, USA, 2016. ACM.
advance-in-memory-and-storage-architecture. [17] J. J. Levandoski, D. B. Lomet, and S. Sengupta. The
[4] Intel R optaneTM dc persistent memory intel virtual bw-tree: A b-tree for new hardware platforms. In
event. [Link] Proceedings of the 2013 IEEE International
architecture-and-technology/optane-dc- Conference on Data Engineering (ICDE 2013), ICDE
[Link]. ’13, pages 302–313, Washington, DC, USA, 2013.
[5] r. Apalkov, A. Khvalkovskiy, S. Watts, V. Nikitin, IEEE Computer Society.
X. Tang, D. Lottis, K. Moon, X. Luo, E. Chen, [18] J. Liu and S. Chen. Initial experience with 3d xpoint
A. Ong, A. Driskill-Smith, and M. Krounbi. main memory. In 2019 IEEE 35th International
Spin-transfer torque magnetic random access memory Conference on Data Engineering Workshops
(stt-mram). J. Emerg. Technol. Comput. Syst., (ICDEW), pages 300–305. IEEE, 2019.
9(2):13:1–13:35, May 2013. [19] Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness
[6] J. Arulraj, J. Levandoski, U. F. Minhas, and P.-A. for fast multicore key-value storage. In Proceedings of
Larson. Bztree: A high-performance latch-free range the 7th ACM European Conference on Computer
index for non-volatile memory. Proceedings of the Systems, EuroSys ’12, pages 183–196, New York, NY,
VLDB Endowment, 11(5):553–565, 2018. USA, 2012. ACM.
[7] J. Arulraj, A. Pavlo, and K. T. Malladi. Multi-tier [20] P. E. McKenney and J. D. Slingwine. Read-copy
buffer management and storage system design for update: Using execution history to solve concurrency
non-volatile memory, 2019. problems. In Parallel and Distributed Computing and
[8] S. K. Cha, S. Hwang, K. Kim, and K. Kwon. Systems, pages 509–518, 1998.
Cache-conscious concurrency control of main-memory [21] A. Merritt, A. Gavrilovska, Y. Chen, and D. Milojicic.
indexes on shared-memory multiprocessor systems. In Concurrent log-structured memory for many-core
VLDB, 2001. key-value stores. Proc. VLDB Endow., 11(4):458–471,
[9] S. Chen and Q. Jin. Persistent b+-trees in non-volatile Dec. 2017.
main memory. Proc. VLDB Endow., 8(7):786–797, [22] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The
Feb. 2015. log-structured merge-tree (lsm-tree). Acta Inf.,
[10] B. F. Cooper, A. Silberstein, E. Tam, 33(4):351–385, June 1996.
R. Ramakrishnan, and R. Sears. Benchmarking cloud [23] I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and
serving systems with ycsb. In Proceedings of the 1st W. Lehner. Fptree: A hybrid scm-dram persistent and
ACM Symposium on Cloud Computing, SoCC ’10, concurrent b-tree for storage class memory. In
pages 143–154, New York, NY, USA, 2010. ACM. Proceedings of the 2016 International Conference on
[11] G. Golan-Gueta, E. Bortnikov, E. Hillel, and Management of Data, SIGMOD ’16, pages 371–386,
I. Keidar. Scaling concurrent log-structured data New York, NY, USA, 2016. ACM.
stores. In Proceedings of the Tenth European [24] D. Schwalb, T. Berning, M. Faust, M. Dreseler, and
Conference on Computer Systems, EuroSys ’15, pages H. Plattner. nvm malloc: Memory allocation for
32:1–32:14, New York, NY, USA, 2015. ACM. nvram. ADMS@ VLDB, 15:61–72, 2015.
[12] D. Hwang, W.-H. Kim, Y. Won, and B. Nam. [25] A. van Renen, V. Leis, A. Kemper, T. Neumann,
Endurable transient inconsistency in byte-addressable T. Hashida, K. Oe, Y. Doi, L. Harada, and M. Sato.
persistent b+-tree. In Proceedings of the 16th USENIX Managing non-volatile memory in database systems.
Conference on File and Storage Technologies, In Proceedings of the 2018 International Conference
FAST’18, pages 187–200, Berkeley, CA, USA, 2018. on Management of Data, SIGMOD ’18, pages
USENIX Association. 1541–1555, New York, NY, USA, 2018. ACM.
[13] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, [26] S. Venkataraman, N. Tolia, P. Ranganathan, and
A. Memaripour, Y. J. Soh, Z. Wang, Y. Xu, S. R. R. H. Campbell. Consistent and durable data
433
structures for non-volatile byte-addressable memory.
In Proceedings of the 9th USENIX Conference on File
and Stroage Technologies, FAST’11, pages 5–5,
Berkeley, CA, USA, 2011. USENIX Association.
[27] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne:
Lightweight persistent memory. In Proceedings of the
Sixteenth International Conference on Architectural
Support for Programming Languages and Operating
Systems, ASPLOS XVI, pages 91–104, New York, NY,
USA, 2011. ACM.
[28] T. Wang, J. Levandoski, and P.-A. Larson. Easy
lock-free indexing in non-volatile memory. In 2018
IEEE 34th International Conference on Data
Engineering (ICDE), pages 461–472. IEEE, 2018.
[29] H. S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P.
Reifenberg, B. Rajendran, M. Asheghi, and K. E.
Goodson. Phase change memory. Proceedings of the
IEEE, 98(12):2201–2227, 2010.
[30] F. Xia, D. Jiang, J. Xiong, and N. Sun. Hikv: A
hybrid index key-value store for dram-pm memory
systems. In Proceedings of the 2017 USENIX
Conference on Usenix Annual Technical Conference,
USENIX ATC ’17, pages 349–362, Berkeley, CA,
USA, 2017. USENIX Association.
[31] J. Yang, Q. Wei, C. Chen, C. Wang, K. L. Yong, and
B. He. Nv-tree: Reducing consistency cost for
nvm-based single level systems. In Proceedings of the
13th USENIX Conference on File and Storage
Technologies, FAST’15, pages 167–181, Berkeley, CA,
USA, 2015. USENIX Association.
[32] H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky,
L. Ma, and R. Shen. Reducing the storage overhead of
main-memory oltp databases with hybrid indexes. In
Proceedings of the 2016 International Conference on
Management of Data, SIGMOD ’16, pages 1567–1581,
New York, NY, USA, 2016. ACM.
434