0% found this document useful (0 votes)
68 views12 pages

Modernization of Databases in The Cloud Era: Building Databases That Run Like Legos

Uploaded by

aaaaaaislmislam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views12 pages

Modernization of Databases in The Cloud Era: Building Databases That Run Like Legos

Uploaded by

aaaaaaislmislam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Modernization of Databases in the Cloud Era:

Building Databases that Run like Legos


Feifei Li
Alibaba Group
[email protected]
ABSTRACT modernized in the cloud-era, so that they are born to be both cloud
Utilizing cloud for common and critical computing infrastructures native and AI native.
has already become the norm across the board. The rapid evolve- That said, we identify four critical trends and requirements for
ment of the underlying cloud infrastructure and the revolutionary the modernization of cloud databases: embracing cloud-native ar-
development of AI present both challenges and opportunities for chitecture, full integration with cloud platform and orchestration,
building new database architectures and systems. It is crucial to co-design for data fabric, and moving towards being AI augmented.
modernize database systems in the cloud era, so that next genera- Modernizing database systems by adopting these critical trends and
tion cloud native databases may run like legos–they are adaptive, addressing key challenges associated with them provide ample op-
flexible, reliable, and smart towards dynamic workloads and varying portunities for data management communities from both academia
requirements. and industry to explore. We will provide an in-depth case study of
That said, we observe four critical trends and requirements for how we modernize PolarDB with respect to embracing these four
the modernization of cloud databases: embracing cloud-native ar- trends in the cloud era.
chitecture, full integration with cloud platform and orchestration, In particular, major cloud service providers have deployed the lat-
co-design for data fabric, and moving towards being AI augmented. est networking infrastructure, memory/storage technologies, and
Modernizing database systems by adopting these critical trends and interconnected heterogeneous computing devices in their IDCs,
addressing key challenges associated with them provide ample op- such as RDMA, persistent memories, CXL memories, cloud storage,
portunities for data management communities from both academia FPGA, GPUs, DPAs such as bluefiled and more. Meanwhile, the or-
and industry to explore. We will provide an in-depth case study of chestration of cloud resources has evolved quickly from hypervisor-
how we modernize PolarDB with respect to embracing these four based virtual machines to light-weight containers using kubernetes
trends in the cloud era. Our ultimate goal is to build databases that (and a mixture of the two). The first generation of cloud native
run just like playing with legos, so that a database system fits for database explores the decouple of computation and storage, so that
rich and dynamic workloads and requirements in a self-adaptive, computation nodes and storage nodes in a database system can
performant, easy-/intuitive-to use, reliable, and intelligent manner. independently scale. That is no longer sufficient, in order to fully ex-
plore the potential of what the underlying cloud infrastructure has
PVLDB Reference Format: to offer. The new generation of cloud native database systems needs
Feifei Li. Modernization of Databases in the Cloud Era: Building Databases to disaggregate storage, memory, and cpu cores, so that advanced
that Run like Legos. PVLDB, 16(12): 4140 - 4151, 2023. features such as serverless, multi-master can be more naturally de-
doi:10.14778/3611540.3611639 veloped. At the same time, orchestration of (often, the computation
nodes) container-based cloud native database instances becomes a
1 INTRODUCTION critical need to achieve high availability, high elasticity and high
resource utilization efficiency.
Cloud has become the de-factor IT infrastructure standard for large
Furthermore, as more and more data are moving to the cloud and
and small enterprises across almost all industry sectors. With the
being generated in the cloud, business applications on the cloud in-
increasing deployment of mission critical systems for businesses
creasingly demand agile development and deployment frameworks
ranging from transaction processing to analytical processing on
for their data processing needs, ranging from transaction processing
the cloud, fundamental changes are required for moving database
to analytical processing. The concept of data-fabric becomes both
systems to the cloud. In particular, we need to transform cloud-
popular and important, that is to enable the easy-sharing and the
hosted databases to cloud-native databases so that database systems
smooth-flow of data in-between different data processing engines
take the best advantages of the underlying cloud infrastructure.
and eco-systems without trouble. This requires the co-design of dif-
Meanwhile, with the rapid deployment of AI and big data technolo-
ferent cloud native database systems to achieve data-fabric, such as
gies, database systems need to be more intelligent, so that they are
zero-ETL between OLTP and OLAP database systems, cloud-native
easy and intuitive to use. In summary, database systems need to be
HTAP, sharing of meta-data among different cloud native database
systems, etc.
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of Lastly, with the eruptive development of ML and AI technologies,
this license. For any use beyond those covered by this license, obtain permission by LLM (large language model) being a latest example, integrating
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights AI technologies natively inside a database system will become a
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 16, No. 12 ISSN 2150-8097. standard practice. In addition to having ML-based tuning, monitor-
doi:10.14778/3611540.3611639 ing, and optimization for database systems (i.e., AIops for database

4140
Scale Out Proxy

Redo
Scale RW RO HTAP Log
Up Mem
Cross-AZ: Multi-zone Edition
PolarStore OSS

High-speed RDMA Interconnect Cross-Region: GDN Clusters

RO MEM
RW RO HTAP X-E Proxy
RW ECU RO ECU HTAP ECU X-Engine PolarProxy
(2C) (2C) (2C) ECU (2C) ECU (2C)
RW HTAP RW ECU: Elastic Compute Unit

Node Node EMU: Elastic Shared Memory Unit


AI MEM StoreA StoreB StoreC
transformation Reconfiguration ESU: Elastic Shared Storage Unit
ESU
( 2C <=> 8C ) AI ECU EMU
(2C) (2C) A. PolarStore / B. OSS /
C. PolarStore@Smart-SSD

Figure 1: Building databases that run like legos.


systems), supporting AI inference inside a database system will also
become widely available. Furthermore, vector database or vector
engine will be a commodity in most database systems to store and
processing high dimension vectors produced by the embedding
process of various LLMs. ML and AI technologies will also change
the way how we interact with a database system. NL2SQL (and
NL2API) interface will witness wide adoption in BI and data science
applications.
In this keynote, we will highlight the key challenges and develop-
ment of modern cloud native databases using PolarDB as an example.
Our ultimate goal is to build databases that run just like playing
with legos as shown in Figure 1, so that a database system fits for
rich and dynamic workloads and requirements in a self-adaptive,
performant, easy-/intuitive-to use, reliable, and intelligent manner.

2 MODERNIZATION OF CLOUD DATABASES Figure 2: PolarDB Architecture.


2.1 Four Trends PolarFS is a durable, atomic and scale-out distributed storage
As mentioned above, we have observed four critical trends and service. It provides virtual volumes that are partitioned into chunks
requirements for the modernization of cloud databases: embracing of 10GB size, which are distributed into multiple storage nodes. A
cloud-native architecture, full integration with cloud platform and volume contains up to 10000 chunks, and can provide a maximum
orchestration, co-design for data fabric, and moving towards being capacity of 100TB. These chunks are provisioned on demand, so
AI augmented. We will first give an overview of PolarDB and then that volume space can grow dynamically. Each chunk has three
provide an in-depth study of PolarDB to demonstrate the develop- replicas, and linear serializablility is guaranteed through Parallel
ment of modern cloud databases with respect to these four critical Raft, which is a consensus protocol derived from Raft.
trends and requirements. The RW node and RO nodes synchronize memory status through
redo logs, and coordinate consistency through log sequence number
(LSN), which indicates an offset of redo log files (e.g., in InnoDB). As
2.2 A Case Study: PolarDB shown in Figure 2, in a transaction ○, 1 after RW finishes flushing
PolarDB [26] is a cloud native database system developed at Alibaba all redo log records to PolarFS ○,
2 the transaction can be committed
Cloud, and adopts a shared storage architecture. It is derived from ○.
3 RW broadcasts messages that the redo log have been updated
the MySQL code base and uses PolarFS [11] as its underlying storage as well as the latest LSN 𝑙𝑠𝑛𝑅𝑊 to all RO nodes asynchronously
pool. It includes one primary read-write node (i.e., RW node) and ○.
4 After the node 𝑅𝑂𝑖 receives the message from RW, it pulls up-
multiple read-only replicas (i.e., RO nodes) in the compute node dates of redo log from PolarFS ○, 5 and applies them to the buffered
layer. Like a traditional database kernel, each RW or RO node page in buffer pool ○, 6 so that 𝑅𝑂𝑖 keeps synchronized with RW.
contains a SQL processor, a transaction engine (like InnoDB [32], Then 𝑅𝑂𝑖 piggybacks the consumed redo log offset 𝑙𝑠𝑛𝑅𝑂𝑖 in the
X-Engine [23]), and a buffer pool to serve queries and transactions. reply and sends it back to RW ○. 7 RW can purge the redo log be-
In addition, there are some stateless proxy nodes for the purpose fore the 𝑚𝑖𝑛{𝑙𝑠𝑛𝑅𝑂𝑖 } location, and flush the dirty pages elder than
of load balancing. 𝑚𝑖𝑛{𝑙𝑠𝑛𝑅𝑂𝑖 } to PolarFS in the background ○. 8 While 𝑅𝑂𝑖 can serve

4141
3.1.2 PolarDB serverless. To address above issues, we propose
a novel cloud database paradigm called disaggregation architec-
ture [13] as shown in Figure 3(b). It goes one step further than the
shared storage architecture. The disaggregation architecture runs
in the disaggregated data centers (DDC), in which CPU, memory
and storage resources are no longer tightly coupled as in a mono-
lithic machine. Resources are located in different nodes connected
through high-speed network. As a result, each resource type can
(a) shared storage architecture (b) disaggregation architecture improve its utilization rate and expand its volume independently.
This also eliminates fate sharing, i.e., allowing each resource be
recovered from failure and upgraded independently. Moreover, data
Figure 3: Architectures for resource disaggregation. pages in the remote memory pool can be shared among multiple
database processes, analogous to the storage pool being shared
in shared storage architecture. Adding a read replica no longer in-
read transactions using the snapshot isolation with version before creases the cost of memory resources, except for consuming a small
𝑙𝑠𝑛𝑅𝑂𝑖 ○.9 Some RO nodes may fall behind because of high CPU piece of local memory.
utilization or network congestion. Say there is a certain node 𝑅𝑂𝑘 , Following this disaggregation architecture, we build PolarDB
whose LSN 𝑙𝑠𝑛𝑅𝑂𝑘 is much lower than that of RW 𝑙𝑠𝑛𝑅𝑊 (the lag Serverless [13]. It introduces a multi-tenant scale-out memory pool,
is larger than one million). Such node 𝑅𝑂𝑘 will be detected and including page allocation and life cycle management. Our first
kicked out of the cluster to avoid slowing down RW through dirty challenge is to ensure that the system executes transactions correctly
page flushes. after adding remote memory to the system. For example, read after
Those stateless proxy nodes provide transparent load balancing write should not miss any updates even across nodes. We solve
service for separating read and write traffics, i.e., distributing read it using cache invalidation. When RW is splitting or merging a
requests to RO nodes and forwarding write requests to RW. B+Tree index, other RO nodes should not see an inconsistent B-
tree structure in the middle. We protect it with global page latches.
3 CLOUD-NATIVE ARCHITECTURE When a RO node performs read-only transactions, it must avoid
reading anything written by uncommitted transactions. We achieve
This section introduces PolarDB’s capabilities that embrace the
it through the synchronization of read views between database
advantages of cloud-native architecture. They have significantly
nodes.
strengthened the system’s flexibility (e.g., via resource disaggrega-
Besides, the evolution of the disaggregation architecture could
tion), efficiency (e.g., via multi-master replication), and consistency
have a negative impact on the database performance. It is because
(e.g., via strong-consistent reads).
the data is likely to be accessed from the remote, which introduces
significant network latency. Our second challenge is to execute
3.1 Resource Disaggregation transactions efficiently. We exploit RDMA optimization extensively,
3.1.1 Background. There are three typical architectures for cloud especially one-sided RDMA verbs, including using RDMA CAS [42]
databases: 1) monolithic machine; 2) virtual machine with remote to optimize the acquisition of global latches. In order to improve
disk; and 3) shared storage as shown in Figure 3(a). The last two can concurrency, both RW and RO nodes use optimistic locking tech-
be referred as the decouple of computation and storage. Though these niques to avoid unnecessary global latches. On the storage side,
architectures have been widely used, they all suffer from challenges page materialization offloading allows dirty pages to be evicted
caused by resource coupling. from remote memory without flushing them to the storage, while
Under the monolithic machine architecture, all resources (e.g., index-aware prefetching improves query performance.
CPU, memory and storage) are tightly coupled. Different resources Finally, the disaggregation architecture complicates the system
allocated on a physical machine are difficult to sustain at a high and hence increases the variety and probability of system failures.
utilization rate, and hence they are prone to fragmentation. In As a cloud database service, our third challenge is to build a reliable
addition, a system with tightly coupled resources has the problem system. We derive our strategies to handle single-node crashes of
of fate sharing, i.e., the failure of one resource will cause the failure different node types, which guarantee that there is no single-point
of other resources, leading to longer system recovery time. With failure in the system. Because the states in memory and storage
the decouple of computation and storage architecture, DBaaS (DB as are decoupled from the database node, crash recovery time of the
a service) can independently improve the resource utilization of the RW node becomes 5.3 times faster [13] than that in the monolithic
storage pool. The shared storage subtype further reduces storage machine architecture.
costs — the primary and read replicas can attach and share the
same storage. Read replicas help to serve high-volume read traffic
and offload analytical queries from the primary. However, in all 3.2 Multi-Master Replication
these architectures, problems like bin-packing of CPU and memory, 3.2.1 Background. Recall that PolarDB consists of one primary
lacking of flexible and scalable memory resources, remain unsolved. (RW) node to process the read/write requests and one or more
Furthermore, each read replica keeps a redundant in-memory data secondary (RO) nodes to handle read-only requests. However, in
copy, leading to high memory costs. write-heavy workloads, the single primary node will become the

4142
Based on the consistent page state, transaction fusion further
supports the concurrent transaction execution on multiple masters,
while guaranteeing the transaction’s ACID. Transaction fusion
provides a centralized Timestamp Oracle (TSO) that monitors the
commits of all transactions. Each transaction has to request a global
commit timestamp (CTS) from the TSO before committing. The
CTS reflects the transaction’s ordering. PolarDB-MM also supports
MVCC to achieve high-throughput and lock-free snapshot reads.
Transaction fusion provides a novel transaction system to speed
up the transaction’s visibility decision.
Lock fusion supports the row-level lock for the concurrent trans-
action’s execution on different master nodes. To support efficient
locking on different nodes, we store the lock information with the
row data. This could save a lot of overhead to maintain the lock
information in a global data structure.
Our evaluation shows that PolarDB-MM has linear scalability
Figure 4: Architecture of PolarDB-MM. when different masters access different parts of the dataset, and each
node’s throughput only drops by 15%-30% when they uniformly
access all the data in SysBench’s read-write workload.
bottleneck. We have to scale up the primary node when the write
pressure becomes heavy, but it is still restricted by the physical 3.3 Strong-consistent Reads
machine’s specification. When the primary node fails, one of the 3.3.1 Background. In PolarDB, to keep an RO node’s buffered data
secondary nodes will be promoted to the new secondary node. How- up-to-date, the RW node generates the corresponding log for each
ever, this kind of high-availability still induces a brief downtime update and ships the log to RO nodes. RO nodes apply the log
during failover [1]. Thus, multi-master cloud-native databases are to update their buffered data. Since the log application process
highly required to support high scalability and availability. is asynchronous, an RO node may be unable to return the latest
updates that have already taken place on the RW node and con-
3.2.2 PolarDB-MM. We design and implement PolarDB-MM (Po- sequently could return “stale” data. Many cloud-native databases
larDB Multi-Master) on top of PolarDB’s shared storage architecture. claim that RO nodes could improve read performance. However, for
It still uses PolarFS [11] as the shared storage, but multiple master the reason outlined above, the service on an RO node can only serve
nodes are connected with the shared storage. All master nodes applications that does not require read-after-write consistency.
have equal access to the shared storage. What’s more, we use mod- However, the strongly consistent read (i.e., a read request always
ern RDMA network to speed up communication among different sees the latest committed updates that happen before it, aka the strict
masters. consistency model [48]) is an essential need in many applications [2].
Figure 4 shows the overview architecture. All master nodes share For instance, in Alibaba’s e-commerce applications, if strong con-
the same data and have equal access to the shared data, and they sistency cannot be guaranteed, the customer who has placed an
could process read/write requests simultaneously. The core compo- order may soon finds that the order does not exist or is shown to be
nent is the Polar Multi-Master Fusion Services (PF) that serves the unpaid after payment. Such need for strongly consistent reads also
master nodes to achieve concurrent transaction execution. The mas- appear in scenarios where databases are used to support interac-
ter nodes and PF are connected via the high-speed RDMA network. tions among microservices [25]. These microservices usually share
Our highly-optimized RDMA library supports ultra-low-latency the same databases and have some dependencies at the application
communication between masters and PF. PF only serves master level. It relies the database to provide strong consistency to ensure
nodes, having no connection to the storage. All the I/Os are issued the interactions proceed as expected. At Alibaba Cloud, we also re-
on the master nodes. ceive many requests from users to provide such strongly consistent
PF has three components, i.e., buffer fusion, transaction fusion and reads, such as insurance companies and financial institutes.
lock fusion. Buffer fusion is designed to achieve buffer coherency
between different master nodes. In PolarDB-MM, each master node 3.3.2 PolarDB-SCC. To support strong consistency, applications
has its own buffer pool and PF maintains a global buffer pool in have to send all read requests to the RW node. Consequently, we
its buffer fusion module. Since all master nodes have equal access cannot improve the read throughput by adding more RO nodes and
to the shared storage, each data page can be read/written by any the RW node could quickly become the bottleneck. This dramat-
master. So we design the distributed page locking scheme to con- ically limits the system’s ability to process read-dominant work-
trol the inter-master concurrent page data access. A master has loads [8, 15, 43]. To support the RO node’s auto-scaling-out, a uni-
to require a shared/exclusive lock to read/write a page from the fied endpoint is required for users. The strongly consistent read
global buffer pool. Once a master updates a page, it will invalidate must be guaranteed by this endpoint to ensure that the writes on
that page on other masters’ local buffer pools before releasing the this endpoint must be immediately visible to the following reads.
page lock. Buffer fusion achieves the page’s consistent state in the Therefore, it’s imperative to have a new system design to ensure
multi-master cluster. strongly consistent reads on RO nodes in a cloud-native database

4143
At last, to further minimize the network overhead, we adopt the
one-sided RDMA for the log shipment and timestamp fetching. We
propose a one-sided RDMA-based log shipment protocol to write
the RW node’s log to the RO nodes. The one-sided RDMA also
saves a lot of CPU cycles during remote writing.

4 CLOUD PLATFORM INTEGRATION AND


ORCHESTRATION
While cloud-native databases are becoming an inexorable trend in
the database industry, its unique advantages root from the close
integration and co-design with the underlying cloud infrastruc-
Figure 5: Architecture of PolarDB-SCC. ture. In this section, we introduce two representative examples that
reflect this concept. We present our best practice in resource sched-
uling of a unified resource pool in Section 4.1, and the adoption of
computational storage devices that enable significant performance
cluster to improve system performance and make the system really improvement in Section 4.2.
scalable.
To enable low-latency strongly-consistent reads on RO nodes, 4.1 Unified Resource Pool
we build PolarDB-SCC [49] (PolarDB Strongly Consistent Cluster), With the rising popularity of container infrastructures (such as Ku-
which is designed with a read-wait policy. It aims to provide a bernetes), there has been a trend to host instances of cloud-native
low-latency, strongly consistent cloud-native database cluster, in databases within containers, benefiting from their strong support
which the RO nodes could always return the latest updates that for orchestration and migration. This helps cloud vendors achieve
are committed ahead of the request’s/transaction’s start timestamp. high availability, high elasticity and high resource utilization effi-
This enables the system to distribute read requests to the RO nodes ciency. A cluster management system is employed to manage the
and split the read/write requests while ensuring strong consistency. jobs or tasks (i.e., database instances) running on the cluster of
As a result, the cluster can provide a unified, strongly consistent machines (nodes). At the center of a cluster management system is
endpoint for applications (e.g., via a proxy), and adjust the number a resource scheduler which dictates when and how cloud resources
of RO nodes on-demand elastically. Resource utilization on RO are allocated to different jobs.
nodes is also improved, rather than deploying RO nodes only for
handling failover. The main challenge is to keep the in-memory 4.1.1 Balance Allocation Rate and Availability. Typically, cloud ven-
data consistent between the RW and RO nodes while ensuring low dors use two critical metrics to assess whether a resource scheduler
latency. The key idea of PolarDB-SCC is to eliminate unnecessary is running “well”: 1) resource allocation rate refers to the proportion
waits and reduce the necessary wait time on the RO node. of allocated resources (out of the total cluster resources) that have
Figure 5 shows the overall architecture. The core components are been allocated by the scheduler to a job; 2) resource availability
the hierarchical modification tracker, Linear Lamport timestamp, refers to the proportion of resource requests (from a job) that can
and the RDMA-based log shipment protocol. The hierarchical mod- be fulfilled within a given period of time. Naturally, cloud vendors
ification tracker maintains the RW node’s modification at three aim for high resource allocation rate (directly leading to lower
levels: the global level maintains the whole database’s latest modifi- operation cost) and high resource availability (directly leading to
cation timestamp; and table/page levels record some tables’/pages’ better customer experience). However, it is intuitively difficult to
newest modification timestamps. To perform a strongly consistent simultaneously maximize these two metrics. In fact, it depends on
read on the RO node, it will first check the RW node’s global level the application scenario and requires balancing the optimization of
timestamp, then the table and page level timestamps. Once one the resource allocation rate and resource availability.
level is satisfied, it will directly process the request and will not
check the next one. It only needs to wait for the log application on 4.1.2 Eigen Resource Scheduler. We propose a resource scheduling
the requested pages when the last level (page level) is unsatisfied. strategy that improves resource allocation rate without hurting
Since the latest modification timestamps are maintained on the resource availability. We adopt a cascading resource flow model
RW node, the RO node has to fetch it from the RW node for each that divide nodes in a cluster into three types: non-empty online
request. Although the RDMA network is fast, the overhead is still nodes, empty online nodes, and offline nodes. In order to simultane-
significant if there is a heavy load on the RO node. To overcome the ously maximize resource allocation rate and resource availability,
overhead on the timestamp fetching, we propose the Linear Lam- we present Eigen [28], an end-to-end resource optimizer that opti-
port timestamp. Based on it, the RO node can store the timestamp mizes the resources in the resource flow simultaneously. We divide
locally after fetching it from the RW node. Any request arriving at machines of each node pool into three layers: online layer (green
the RO node earlier than 𝑇 𝑆𝑟𝑜 can directly use the locally stored box), warm layer (orange box), and cold layer (blue box). Figure 6
timestamp instead of fetching a new one from the RW node. This depicts the resource flows that show how machines move between
can save many fetching requests when the load is heavy on RO layers and node pools. We describe the optimization problems at
nodes. each layer as follows:

4144
Creating, Scaling, Evaluating Requests and hardware. On one hand, each storage node must be equipped
Control Flow
Resource Flow Eigen Master with sufficient data processing power to handle table scan tasks.
On the other hand, to maintain the cost effectiveness of cloud-
Data Center
native databases, we cannot significantly (or even modestly) in-
Node Pool A (MySQL) Node Pool B (Redis) Node Pool C (PolarDB) crease the cost of storage nodes. By complementing CPUs with
Online Layer special-purpose hardware (e.g., GPU and FPGA), heterogeneous
Vectorized Resource
Optimization computing architecture appears to be an appealing option to address
Warm Layer this data processing power vs. cost dilemma. Under this framework,
Node Pool Node Pool Node Pool
ES with Smoothed each data storage device becomes a computational storage drive
Auto-scaler Auto-scaler Auto-scaler
Adaptive Margins
that can carry out table scan on the I/O path. However, its practically
Cold Layer
Temporal CNN & viable implementation is challenging, mainly due to the difficulty
Minimum-stock Policy
of addressing two challenges:
Suppliers
• Support pushdown across the entire software hierarchy:
Table scan pushdown is initiated by PolarDB storage engine that
Figure 6: Eigen’s hierarchical resource management system. accesses data by specifying the offsets in files, while table scan is
Online layer consists of non-empty machines (online machines). physically served by computational storage drive that operates
In this layer, the optimization problem is to allocate database in- as a raw block device and manages data with LBA (logical block
stances with heterogeneous resource requests on as few as possi- address). The entire storage I/O stack sits in between the PolarDB
ble online machines. We design Vectorized Resource Optimization storage engine and the computational storage drive. Hence, we
(VRO), which consists of an online version and an offline version. have to cohesively enhance/modify the entire software/driver
The online version of VRO is implemented in Scheduler to sched- stack in order to create a path for table scan pushdown.
ule resource allocations for online requests. To support regional • Implement low-cost computational storage drive: Although
scheduling on large-scale clusters of up to 100K machines, the com- the FPGA-based design approach can significantly reduce the
putation of scheduling should execute as efficiently as possible. The development cost, FPGA tends to be expensive. Moreover, since
offline version of VRO is implemented in Rebalancer, which period- FPGA typically operates at only 200-300MHz (in contrast to 2-
ically rebalances the cluster through pod moves (i.e., migrations of 4GHz CPU clock frequency), we have to employ a large degree of
database instances). In addition to consolidating clusters, the offline circuit-level implementation parallelism (hence more silicon re-
version of VRO focuses on reducing the number of migrations for source) in order to achieve sufficiently high performance. There-
lower rebalance costs. fore, we must develop solutions to enable the use of low-cost
Warm layer consists of empty machines (warm machines) which FPGA chip in our implementation.
work as “buffers” to support high resource availability. The op- To address these two challenges, PolarDB adopts a set of soft-
timization problem in this layer is to evaluate the minimum of ware/hardware techniques [10]: To reduce the product development
resources which will not cause delayed requests in short-term time cycle and meanwhile ensure cost effectiveness, computational stor-
periods (e.g., ten seconds, one minute, ten minutes). We design age drives use an FPGA-centric host-managed architecture. Inside
Exponential Smoothing (ES) with smoothed adaptive margins, a each computational storage drive, a single low-cost Xilinx FPGA
resource reservation algorithm that predicts short-term resource chip handles both flash memory control and table scan. With highly
usage with a smoothed adaptive margin. It is implemented in Node optimized software and hardware design, each computational stor-
Pool Auto-scaler, and automatically scales up/down warm machines. age drive can support high-throughput (i.e., over 2GB/s) table scan
Cold layer consists of offline machines (cold machines). In this on compressed data and meanwhile achieve storage I/O perfor-
layer, the optimization problem is to evaluate the minimum of cold mance comparable to leading-edge NVMe SSDs.
machines which will not cause failed requests in a long-term time In addition to offloading scan operations to the computational
period (e.g., one week, three weeks, one month). We train and storage devices, PolarDB has further leveraged smartSSD to per-
deploy probabilistic time-series forecasting models on Node Pool formance transparent data compression. The compression is per-
Auto-scaler to predict long-term daily resource consumption (i.e., a formed by the FPGA (or ASIC) colocated with SSDs, delivering over
daily difference of allocated resources). Based on the predictions, 60% reduction in storage (compared to uncompressed data) with
we design a Minimum-stock Policy that suggests on adding or no negative impact on throughput and little increase in the cost of
removing cold machines. SSDs. Co-design of software and hardware extends beyond storage
devices; for example, the emergence of new technologies such as
4.2 Co-design of Software and Hardware software defined network (SDN) and Compute Express Link (CXL)
To best serve OLTP workloads, cloud-native relational databases, opens new directions that explores in-network computation and
such as PolarDB, typically employ the row-store model. A viable memory pooling and sharing.
option for these database to better serve analytical workloads is to
offload data-access-intensive tasks (in particular table scan) from 5 CO-DESIGN FOR DATA FABRIC
database nodes to storage nodes. In spite of the simple concept, its Another trend that we have witnessed is that the line between
practical implementation in the context of cloud-native databases is different types of database engines start to blur: there is a grow-
particularly non-trivial, and requires careful co-design of software ing need for a database to provide sufficient support for a wide

4145
variety of different workloads (notably, both transactional process- T Q T Q
ing and analytical processing), especially in the fields of business
intelligence [44], social media [7, 31], fraud detection [9], and mar-
keting [19, 52]. To provide such capability, traditional solutions
often deploy data and application logic into multiple databases spe- REDO
cialized in processing different types of queries, and rely on data OLTP DBMS OLAP DBMS HTAP DBMS
synchronization techniques (such as ETL) for consistencies among (e.g., PolarDB) (e.g., ADB) (PolarDB-IMCI)
them. Such solutions are costly, as it negatively impacts the OLTP
Figure 7: Architecture of HTAP databases.
performance, and introduces a time-consuming data synchroniza-
tion process, which further leads to delays or even inconsistencies
between the data maintained at the TP/AP databases. Additionally, • G#3: Minimal Perturbation on OLTP Workloads. While
users are often provided with isolated portals or interfaces for ac- the performance of OLAP queries is significantly improved, it
cess each of the different databases. In practice, these issues lead to should have a minimal negative impact on the performance of
sub-optimal user experience. OLTP queries. In fact, as we have practically validated in real
The concept of data fabric emerges in recent years, promoting application scenarios, OLTP queries are usually more mission-
easy-sharing and smooth-flow of data in-between different data pro- critical and are more sensitive to performance degradation. This
cessing engines and even different eco-systems. In this section, we requires effective resource isolation for OLTP and OLAP queries.
discuss the designs that enables unified database interface through • G#4: High Data Freshness. High data freshness is an important
meta-data sharing among engines (Section 5.1), and allows smooth property of HTAP databases, which is a distinguishing advan-
data flow in-between different database engines through zero-ETL tage compared to the traditional Extract-Transform-Load (ETL)
and cloud-native HTAP (Section 5.2). method. Conventionally, visibility delay is used as a freshness
score for a query: visibility delay is the time interval during
5.1 Unified Database Interface which updates to the database can be visible to OLAP queries.
The wide variety of database engines allows customers to have Figure 7 summarizes two approaches that are designed with
different choices for their data processing needs, however, at the these goals in mind: the system could adopt a zero-ETL approach
same time, it poses challenges to the customers: they need to make (shown on the left), where the data synchronization, typically us-
“wise" decisions in terms of which engine fits their needs the best, ing logical logs, is co-designed with the source and destination
and how the data should be stored (e.g., within the same engine or engines for achieving goal G#2-G#4; one could alternatively adopt
across multiple engines). It calls for a unified database interface to a cloud-native HTAP approach (shown on the right), where the
simplify this process and alleviate the burden. synchronization is done through the shared storage, using physical
logs, and an in-memory column index (IMCI) is built along side the
• G#1: Transparent Query Execution. To serve mixed work-
row-oriented data for facilitate the processing of analytical queries.
loads in a single database, database users should not be required
The zero-ETL approach is more general while the IMCI approach
to understand the working logic of the database, nor should they
allows for minimal visibility delay (i.e., highest data freshness).
identify query types manually. That is, users should not perceive
multiple isolated systems (e.g., engines, indexes, interfaces, etc.) 5.2.1 Zero-ETL between OLTP and OLAP. The key design of the
for workloads with different characteristics. zero-ETL approach focuses on reducing the resource as well as the
To achieve this, it requires cross-engine sharing of metadata, time needed for transfering data from OLTP to OLAP. Crucially, to
such that the database system can make decisions from a holistic address the discrepancy between the write performance of OLTP
view. We adopt a unified management of metadata, including table and OLAP engines. We adopt a Delta + Main design:
schema and statistics, data lineage, database topology, and execu- For general stream writing scenarios, data is written to a write-
tion history. Based on these information, the database system’s optimized store (i.e., the delta store), which typically is in the form
optimization and fabric layer can generate and coordinate, through of an LSM-tree. The goal is to fully utilize the capabilities of row
a rule-based or cost-based process, executions in different engines. storage and make up for the performance shortcomings of colum-
nar storage. Then a compaction process running at background
5.2 Data Flow in-between DB Engines will automatically sort, merge, and eventually write the data to a
With the unified execution, the data engine responsible for handling read-optimized storage (i.e., the main store): Data is first sorted
writes (e.g., insertion, deletion and update) might not be the one by a predefined sort key. Once sorted, the data is easier to prune
that is best fit for reads. Particularly, OLTP engines are often opti- during scanning, reducing I/O overhead and improving scanning
mized for write performance, while OLAP engines are optimized for performance. Then the database merges these write-friendly delta
processing complex analytical queries. The data flow in-between data into read-friendly columnar storage, and promptly reclaims
DB engines ideals should satisfy the following properties: the storage space.
• G#2: Advanced OLAP Performance. As a major goal of any 5.2.2 Cloud-native HTAP. Figure 8 shows the architecture of PolarDB-
HTAP database, its OLAP performance (e.g., execution latency) IMCI, a cloud native HTAP designed and operated by Alibaba
should be comparable to typical databases specialized in pro- Cloud [45]. It adopts PolarFS [12] as its storage layer, and a compu-
cessing OLAP queries (typically through the introduction of tation layer that contains multiple computation nodes, including a
columnar data storage). primary node for read/write requests (RW node), several nodes for

4146
Application Layer system to pinpoint root causes for performance issues[22, 29]. These
Database Proxy
new features can enhance database systems by automating data
Scale Out/In
management processes, improving user experience, and enabling
Read-only Replica
intelligent decision-making in DevOps.
Read/Write Primary
Server Server

SQL Query Binlog Binlog SQL Query

Read
Write

Parser/Optimizer/Executor Parser/Optimizer/Executor
IMCI Engine
Scale
InnoDB InnoDB
Up/Down
Transaction Exe. Transaction Exe.
In-memory Index B+ Tree In-memory Index B+ Tree IMCI

Buffer Pool REDO REDO Applier Buffer Pool

Shared Storage Data Data Data Data


Chunk Chunk Chunk Chunk
(PolarFS)

Figure 8: Architecture of PolarDB-IMCI.

read-only requests (RO nodes), and several stateless proxy nodes for
load balancing. To speed up analytical queries, PolarDB-IMCI sup-
ports building in-memory column indexes on the row store of RO
nodes. Column indexes store data in insertion order and perform
out-place writes for efficient updates. The insertion order means a
row in column indexes that can be quickly located by its Row-ID Figure 9: AI-augmented databases
(RID) rather than its primary key (PK). To support PK-based point
lookups, PolarDB-IMCI implements a RID locator (i.e., a two-layer
LSM tree) for PK-RID mapping. 6.1 NL2SQL
PolarDB-IMCI uses an asynchronous replication framework for Natural language to SQL (NL2SQL) techniques provide a conve-
synchronization between RO and RW. That is, updates to RO nodes nient interface to access databases, especially for non-expert users,
are not included in the transaction commit path of the RW to avoid to conduct various data analytics. With prior knowledge on the
the impact on the RW node. To enhance data freshness on RO database schema information, PolarDB can automatically translate
nodes, PolarDB-IMCI uses two optimizations on the log applying, the natural language question into the corresponding SQL query.
the commit-ahead log shipping, and the conflict-free parallel log To provide high-quality translation, there are three challenges: (1)
replay algorithm. RO nodes are synchronized by REDO logs of the what tables and columns should be used in the query; (2) what is
row store, which causes very low perturbation on OLTP than the the correct query structure; and (3) how to fill in query details and
approaches that uses logical logs (such as Binlogs). Note that it’s the literal in the query.
nontrivial to apply physical logs into column indexes as the data Recently, the research on natural language models have made
format of the row store and column index is heterogeneous. significant breakthroughs. Deep models based on transformer archi-
Inside each RO node, PolarDB-IMCI uses two mutually symbiotic tectures have demonstrated great performance due to its excellent
execution engines: PolarDB’s regular row-based execution engine in-context learning capabilities. Interestingly, SQL languages are
to serve OLTP queries, and a new column-based batch mode execu- traditionally investigated in the framework of context-free gram-
tion engine for the efficient running of analytical queries. The batch mars. The former captures complex meanings, and generalizes well,
mode execution engine draws on the techniques used by colum- but may result in queries with syntactic or semantic errors. The
nar databases to handle analytical queries, including a pipeline latter can define high level structures with rigorous and accurate
execution model, parallel operators, and a vectorized expression grammars, but may miss the global picture of the whole sentence.
evaluation framework. The regular row-based execution engine We introduce the NL2SQL module for PolarDB, by bridging the gap
with augmented optimizations can undertake the column engine’s between the two and design a new framework to significantly im-
incompatible queries or point queries. prove both accuracy and runtime. In particular, we develop a novel
sketch [20], which constructs a template with slots that initially
6 AI AUGMENTED DATABASES serve as placeholders, and tightly integrates with a deep learning
With the rapid progress in artificial intelligence (AI) technologies, model to fill in these slots with meaningful contents based on the
PolarDB takes a proactive approach to integrate AI features into database schema.
database systems, so as to be easier to use, more efficient to operate, Compared with the widely used sequence-to-sequence approaches,
and exhibit certain intelligence in deriving valuable insights from our sketch-based method does not need to generate keywords
data. To this end, we have put into production a number of AI- which are boilerplates in the template, and can achieve better accu-
augmented functionalities driven by customer needs. The most racy and run much faster. Compared with the existing sketch-based
relevant features are two: an enhanced natural language interface approaches, our method is more general and versatile, and can
that can convert questions into SQL statements [20], and a diagnosis leverage the values already filled in on certain slots to derive the

4147
rest ones for improved performance. In addition, we also lever- impact of each attribute values. This system has been successfully
ages database domain knowledge, by introducing a post-processing deployed with more than 10,000 operations for 86 services on 2,546
module. It checks the initially generated SQL queries by applying machines. It has significantly improved the DevOps efficiency, and
rules to identify and correct semantic errors. This technique sig- greatly reduced the system failure rates.
nificantly improves the NL2SQL accuracy. Extensive evaluations
on both single-domain and cross-domain benchmarks demonstrate 7 OUTLOOK AND FUTURE TRENDS
that our approach can achieve great accuracy and throughput.
7.1 LLM and Vector DB
Large language models (LLM) and vector databases, combined to-
6.2 In-DB AI gether, bring exciting new opportunities in various fields, thanks
In-DB AI is to reduce the unnecessary data movement between to the progresses in natural language processing and information
databases and AI computation engines. It can shorten the devel- retrieval. These technologies have significantly improved the way
opment cycle and lower the operational cost by integrating AI we interact with computers and access information.
capabilities through SQL queries natively. We build an in-database Large language models, such as ChatGPT and Alibaba Tong Yi,
inference framework that generates database loadable functions are capable of generating human-like text and understanding the
from already developed AI models. After registering the AI models, context and meaning of words and sentences. They can be used for
user could directly write SQLs to invoke model inference coher- a wide range of applications, including language translation, text
ently into the query flow. To flexibly support models with different summarization, and creative writing. Vector databases can store
sizes and platforms, our framework considers three scenarios. First, and retrieve data by representing them as vectors. It can compare
we provide pre-installed model functions. Second, user can upload and search for similar items based on their inherent characteristics.
a trained model, which automatically generates static objects for The combination of large language models and vector databases
shared files inside the database instances. We support models from can enable more accurate and natural text generation with domain
Tensorflow, PyTorch and libraries such as XGBoost. Third, for large knowledge, as well as efficient and personalized data retrieval.
models, e.g., LLMs (Large Language Models), our framework gen- Vector database provides new dimensions in forming queries that
erates a hook function to call external model inference service. For enable more intelligence. In certain applications both unstructured
all of the above scenarios, users can write native SQL statements to and structured data shall be jointly queried. To address this chal-
augment data queries by AI models. lenge due to hybrid queries, we design and implement AnalyticDB-
V (ADBV) [47] that manages feature vectors and structured data
6.3 AIOps for Cloud Native Database Systems cooperatively. In order to improve the accuracy of querying mas-
PolarDB architecture follows a decoupling design principle, whose sive data in vector forms, we design an index using Voronoi Graph
individual components, like any other distributed systems, may Product Quantization, which could efficiently narrow down the
contain inevitable faults or failures, and are organized together search scope for fast indexing. In addition, it is wrapped as physical
through highly efficient message passing over RDMA networks. operators that can rely on the query optimizer to efficiently and
To make it robust and easier to manage, we design a new root effectively process hybrid queries.
cause diagnosis facility for its DevOps system[22, 29]. It is based
on an observation that anomalies, or general variations, require 7.2 Data Security
a quantitative characterization of the influences from individual 7.2.1 Background. During the past decade, practical techniques for
components on the end-to-end performance metrics. database security have not witnessed significant advances, where
On a causal graph that represents the complex dependencies be- access control, file encryption, database audit have long become
tween the system components, the scatteredly detected anomalies, de-facto standards [16, 17] that protect the database from unex-
even when they look similar, could have different implications with pected accesses and external attacks. In conventional settings, a
contrastive remedial actions. Though various heuristic methods database system shall run in a private domain, and the system
have been successfully applied for certain cases in practice, the owners (as well as administrators) inherently have full access to
complexity constantly imposes various challenges for pinpointing the data inside. However, recent trends have overturned the as-
the root causes. Various heterogeneous components are connected sumption and brought new security issues. The first trend is cloud
on a (non-strict) causal/relationship graph. Most existing works computing. Cloud systems, as an outsourced infrastructure, break
focus on predicting the performance issues (or more rigorously, the private domain assumption. and hence they might be poten-
counterfactuals) when a given set of components/factors change to tially compromised by insiders (e.g., co-tenants or rogue staffs). The
a hypothetical operating state. Our work complements by quantify- second trend is that the data-centric revolution has complicated
ing the influences of the variations from the different factors using the data management in applications. More specifically, the data
Shapley value, which is a concept in game theory to fairly allocate needs to flow between different processing components, each of
the influences of the factors. A factor’s influence is measured by which is probably controlled by a different entity (e.g., internal sub-
the average performance difference before and after adding this divisions, business partners, and independent software vendors).
factor to a random set of existing factors. For example, one can Once the data flow into others’ subsystems and databases, it is no
study the difference of the end-to-end latency by adding an issue longer under the control of the original data owner, leading to a
of high CPU usage. Our system can automatically drill down to the contradiction between the utilization and the ownership of data.
combinations of attributes where anomalies occur and evaluate the The data ownership here involves many aspects of data security

4148
rich database functionalities (e.g., mix-typed expression, connec-
tion pool, client driver) beyond basic operational primitives, we
should also provide corresponding functionalities while preserving
data ownership. Operon uses a set of server-side and client-side
co-designs to solve this problem.

7.2.3 Verifiable database. Currently, data integrity largely relies


on the customer’s trust that the cloud service provider has faithfully
Figure 10: Illustration of supporting the OPDB with Operon. maintained the data (and computed results) outsourced to them. It
Data owners use their keys (differently colored) to protect remains challenging for the customer to verify whether the data or
data against any application processes or databases. It only computation results retrieved from the cloud are correct; from the
performs permitted operations (i.e., by BCL) within TEE. cloud service provider’s perspective, such capability of proving to
the client that its data is correctly handled is also a highly desirable
security feature, encouraging hesitant clients to adopt cloud-centric
solutions. Therefore, verifiability is a sufficiently strong guarantee
issues, such as the confidentiality of users’ sensitive data, and the in practice, i.e., the correctness of any returned results is (crypto-
authenticity of user’s query results. graphically) verifiable. It allows the client to detect any faults with
non-reputable evidence, and allows the cloud service provider to
7.2.2 Encrypted database. Many encrypted database systems have retain a formal proof for its correct operations. There has already
been built by both academia and industry to protect the confiden- been a large body of work in the field of providing verifiability for
tiality of sensitive data in outsourced databases. They either exploit cloud databases [4, 6, 27, 34, 36, 51]. Notably, a classic approach
special cryptographic primitives [14, 21, 33] to support data manip- leverages cryptographic primitives to verify the query result of a
ulation directly over ciphertext [35, 37, 40] or use trusted execution specific query [30] or an arbitrary SQL [50, 51], but with either
environments (TEE) [18, 24] to operate on sensitive data in an iso- limited capability or poor performance.
lated enclave inaccessible from the rest of the host [3, 5, 38, 39, 41]. The emergence of TEE provides a new avenue towards verifiable
However, existing solutions assume that the authorized endpoint database. Such trust hardwares act as an additional trust anchor,
directly interacting with the encrypted database is trusted and can allowing great simplification and, in turn, performance improve-
touch sensitive data, which is hard to achieve in practice. Hence, ment [4]. In consequence, we build VeriDB [53], an SGX-based
we propose a new paradigm for the encrypted database, namely verifiable database that supports relational tables, multiple access
ownership-preserving database (OPDB) [46], with which the data methods and general SQL queries. In VeriDB, the client interacts
is not even revealed to any subsystems and hence the data owner with a query engine that resides in an SGX enclave. Hence, the
exclusively governs data accessibility. In a nutshell, all sensitive returned query result can be trusted and easily verified (by check-
data remains in ciphertext wherever it appears (e.g., in the mem- ing whether it is endorsed by the SGX), as long as the inputs to
ory of application/database server processes or on the disk) and the query engine (i.e., the data retrieved from the storage) are cor-
only the data owner can decrypt the data. When an entity (e.g., a rect. This effectively reduces the problem of verifying the query
business partner) needs to process or utilize the sensitive data in correctness to that of verifying the integrity of data retrieval from
its business logic, the data owner only needs to grant it access to the storage.
necessary operations that are adequate to complete the task using
cipher processing capability in an OPDB instance. With OPDB,
sensitive data can be securely passed and processed across different 8 CONCLUSION
entities’ subsystems and databases, which significantly reduces the Cloud databases have evolved significantly, yet many more impor-
risk of data abuses and leakages throughout the entire process. tant and interesting features are still to be explored and developed.
Following the OPDB paradigm, we build Operon [46] as its first With the rapid development of cloud infrastructures and AI tech-
implementation, utilizing TEE to re-establish the private domain nologies, deep and seamless integration of cloud databases with
assumption. Operon introduces a behavior control list (BCL), which both the cloud platform and AI techniques are only to be further
extends conventional system-oriented resource access control with intensified. By fully leveraging the benefits offered by the underly-
the control of data operation behaviors, decoupling the data owner- ing cloud infrastructure and the advanced AI techniques, we may
ship away from the system ownership. Operon is the first database soon expect running cloud databases just like playing with legos.
framework with which the data owner exclusively controls the
accessibility and behaviors of sensitive data, even when the data
has passed through many entities’ untrusted subsystems. Figure 10 ACKNOWLEDGMENTS
shows an example of supporting OPDB with Operon. It adopts a This paper is a summary of the collective work done by the Alibaba
modular architecture and acts as a feature enhancement to existing Cloud’s ApasaraDB database team, in particular, the PolarDB team.
database systems. We have successfully integrated it with different The author is grateful to the entire database team, and in particular,
TEEs (i.e., Intel SGX [18] and an FPGA-based implementation), as to Wei Cao, Zongzhi Chen, Qingda Hu, Gui Huang, Jian Tan, Jiany-
well as various database services on Alibaba Cloud (i.e., PolarDB ing Wang, Sheng Wang, Jimmy Xingjun Yang, Guangzhou Zhang,
and RDS). In addition, since real-world applications always involve Yingqiang Zhang, Wenchao Zhou, and many fellow team members.

4149
REFERENCES Computer Society, 256–266.
[1] [n.d.]. Working with Aurora Multi-master Clusters. https://docs.aws.amazon. [20] Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. 2023. CatSQL:
com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html. Towards Real World Natural Language to SQL Applications. Proc. VLDB Endow.
[2] Phillipe Ajoux, Nathan Bronson, Sanjeev Kumar, Wyatt Lloyd, and Kaushik 16, 6 (apr 2023), 1534–1547.
Veeraraghavan. 2015. Challenges to Adopting Stronger Consistency at Scale. In [21] Craig Gentry. 2009. Fully Homomorphic Encryption Using Ideal Lattices. In
15th Workshop on Hot Topics in Operating Systems (HotOS XV). Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing
[3] Panagiotis Antonopoulos, Arvind Arasu, Kunal D. Singh, Ken Eguro, Nitish (Bethesda, MD, USA) (STOC ’09). Association for Computing Machinery, New
Gupta, Rajat Jain, Raghav Kaushik, Hanuma Kodavalla, Donald Kossmann, Niko- York, NY, USA, 169–178. https://doi.org/10.1145/1536414.1536440
las Ogg, Ravi Ramamurthy, Jakub Szymaszek, Jeffrey Trimmer, Kapil Vaswani, [22] Xiao He, Ye Li, Jian Tan, Bin Wu, and Feifei Li. 2023. OneShotSTL: One-Shot
Ramarathnam Venkatesan, and Mike Zwilling. 2020. Azure SQL database always Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And
encrypted. In Proceedings of the 2020 International Conference on Management of Forecasting. Proc. VLDB Endow. 16, 6 (apr 2023), 1399–1412.
Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, [23] Gui Huang, Xuntao Cheng, Jianying Wang, Yujie Wang, Dengcheng He, Tieying
New York, NY, USA, 1511–1525. https://doi.org/10.1145/3318464.3386141 Zhang, Feifei Li, Sheng Wang, Wei Cao, and Qiang Li. 2019. X-Engine: An Opti-
[4] Arvind Arasu, Ken Eguro, Raghav Kaushik, Donald Kossmann, Pingfan Meng, mized Storage Engine for Large-Scale E-Commerce Transaction Processing. In
Vineet Pandey, and Ravi Ramamurthy. 2017. Concerto: A High Concurrency Proceedings of the 2019 International Conference on Management of Data (Amster-
Key-Value Store with Integrity. In Proceedings of the 2017 ACM International dam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New
Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, York, NY, USA, 651–665. https://doi.org/10.1145/3299869.3314041
May 14-19, 2017. ACM, 251–266. [24] Advanced Micro Devices Incorporated. 2005. Secure Virtual Machine Architec-
[5] Maurice Bailleu, Jörg Thalheim, Pramod Bhatotia, Christof Fetzer, Michio Honda, ture Reference Manual.
and Kapil Vaswani. 2019. Speicher: Securing LSM-Based Key-Value Stores Using [25] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and
Shielded Execution. In Proceedings of the 17th USENIX Conference on File and Marcos Kalinowski. 2021. Data Management in Microservices: State of the
Storage Technologies (Boston, MA, USA) (FAST ’19). USENIX Association, USA, Practice, Challenges, and Research Directions. arXiv preprint arXiv:2103.00170
173–190. (2021).
[6] Sumeet Bajaj and Radu Sion. 2013. CorrectDB: SQL Engine with Practical Query [26] Feifei Li. 2019. Cloud-Native Database Systems at Alibaba: Opportunities and
Authentication. Proc. VLDB Endow. 6, 7 (2013), 529–540. Challenges. Proc. VLDB Endow. 12, 12 (Aug. 2019), 2263–2272. https://doi.org/
[7] Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkarup- 10.14778/3352063.3352141
pan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, [27] Feifei Li, Marios Hadjieleftheriou, George Kollios, and Leonid Reyzin. 2006.
Aravind Menon, Samuel Rash, Rodrigo Schmidt, and Amitanand Aiyer. 2011. Dynamic authenticated index structures for outsourced databases. In Proceedings
Apache Hadoop Goes Realtime at Facebook. In Proceedings of the 2011 ACM SIG- of the ACM SIGMOD International Conference on Management of Data, Chicago,
MOD International Conference on Management of Data (SIGMOD ’11). Association Illinois, USA, June 27-29, 2006. ACM, 121–132.
for Computing Machinery, New York, NY, USA, 1071–1080. [28] Ji You Li, Jiachi Zhang, Wenchao Zhou, Yuhang Liu, Shuai Zhang, Zhuoming
[8] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Xue, Ding Xu, Hua Fan, Fangyuan Zhou, and Feifei Li. 2023. Eigen: End-to-end
Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, et al. 2013. Resource Optimization for Large-Scale Databases on the Cloud. In International
TAO: Facebook’s Distributed Data Store for the Social Graph. In 2013 USENIX Conference on Very Large Data Bases (VLDB).
Annual Technical Conference (USENIX ATC 13). 49–60. [29] Ye Li, Jian Tan, Bin Wu, Xiao He, and Feifei Li. 2024. ShapleyIQ: Influence
[9] Shaosheng Cao, XinXing Yang, Cen Chen, Jun Zhou, Xiaolong Li, and Yuan Qi. Quantification by Shapley Values for Performance Debugging of Microservices.
2019. TitAnt: Online Real-Time Transaction Fraud Detection in Ant Financial. (2024).
Proc. VLDB Endow. 12, 12 (aug 2019), 2082–2093. [30] Ralph C. Merkle. 1987. A Digital Signature Based on a Conventional Encryption
[10] Wei Cao, Yang Liu, Zhushi Cheng, Ning Zheng, Wei Li, Wenjie Wu, Linqiang Function. In Advances in Cryptology - CRYPTO ’87, A Conference on the Theory
Ouyang, Peng Wang, Yijing Wang, Ray Kuan, Zhenjun Liu, Feng Zhu, and Tong and Applications of Cryptographic Techniques, Santa Barbara, California, USA,
Zhang. 2020. POLARDB Meets Computational Storage: Efficiently Support August 16-20, 1987, Proceedings (Lecture Notes in Computer Science), Vol. 293.
Analytical Workloads in Cloud-Native Relational Database. In 18th USENIX Springer, 369–378.
Conference on File and Storage Technologies (FAST 20). USENIX Association, Santa [31] Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, and Jimmy Lin. 2013.
Clara, CA, 29–41. Fast data in the era of big data: Twitter’s real-time related query suggestion
[11] Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui architecture. In Proceedings of the 2013 ACM SIGMOD International Conference
Wang, and Guoqing Ma. 2018. PolarFS: An Ultra-Low Latency and Failure on Management of Data. 1147–1158.
Resilient Distributed File System for Shared Storage Cloud Database. Proc. VLDB [32] Oracle. [n.d.]. InnoDB. https://dev.mysql.com/doc/refman/8.0/en/innodb-storage-
Endow. 11, 12 (Aug. 2018), 1849–1862. https://doi.org/10.14778/3229863.3229872 engine.html.
[12] Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui [33] Pascal Paillier. 1999. Public-Key Cryptosystems Based on Composite Degree
Wang, and Guoqing Ma. 2018. PolarFS: An ultralow latency and failure resilient Residuosity Classes. In Proceedings of the 17th International Conference on Theory
distributed file system for shared storage cloud database. Proceedings of the VLDB and Application of Cryptographic Techniques (Prague, Czech Republic) (EURO-
Endowment 11, 12 (2018), 1849–1862. CRYPT ’99). Springer-Verlag, Berlin, Heidelberg, 223–238.
[13] Wei Cao, Yingqiang Zhang, Xinjun Yang, Feifei Li, Sheng Wang, Qingda Hu, [34] HweeHwa Pang, Arpit Jain, Krithi Ramamritham, and Kian-Lee Tan. 2005. Verify-
Xuntao Cheng, Zongzhi Chen, Zhenjun Liu, Jing Fang, Bo Wang, Yuhui Wang, ing Completeness of Relational Query Results in Data Publishing. In Proceedings
Haiqing Sun, Ze Yang, Zhushi Cheng, Sen Chen, Jian Wu, Wei Hu, Jianwei Zhao, of the ACM SIGMOD International Conference on Management of Data, Baltimore,
Yusong Gao, Songlu Cai, Yunyang Zhang, and Jiawang Tong. 2021. PolarDB Maryland, USA, June 14-16, 2005. ACM, 407–418.
Serverless: A Cloud Native Database for Disaggregated Data Centers. In Proceed- [35] Vasilis Pappas, Fernando Krell, Binh Vo, Vladimir Kolesnikov, Tal Malkin, Se-
ings of the 2021 International Conference on Management of Data (SIGMOD ’21). ung Geol Choi, Wesley George, Angelos Keromytis, and Steve Bellovin. 2014.
2477–2489. Blind Seer: A Scalable Private DBMS. In Proceedings of the 2014 IEEE Sympo-
[14] Nathan Chenette, Kevin Lewi, Stephen A Weis, and David J Wu. 2016. Practical sium on Security and Privacy (S&P ’14). IEEE Computer Society, USA, 359–374.
order-revealing encryption with limited leakage. In International conference on https://doi.org/10.1109/SP.2014.30
fast software encryption. Springer, 474–493. [36] Yanqing Peng, Min Du, Feifei Li, Raymond Cheng, and Dawn Song. 2020. Fal-
[15] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, conDB: Blockchain-based Collaborative Database. In Proceedings of the 2020
Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, International Conference on Management of Data, SIGMOD Conference 2020, on-
Peter Hochschild, et al. 2013. Spanner: Google’s Globally Distributed Database. line conference [Portland, OR, USA], June 14-19, 2020. ACM, 637–652.
ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 1–22. [37] Raluca Ada Popa, Catherine M. S. Redfield, Nickolai Zeldovich, and Hari Bal-
[16] Microsoft Corporation. 2022. Row-Level Security. Retrieved March 1, 2022 akrishnan. 2011. CryptDB: Protecting Confidentiality with Encrypted Query
from https://docs.microsoft.com/en-us/sql/relational-databases/security/row- Processing. In Proceedings of the Twenty-Third ACM Symposium on Operating
level-security Systems Principles (Cascais, Portugal) (SOSP ’11). Association for Computing Ma-
[17] Oracle Corporation. 2022. Virtual Private Database. Retrieved March 1, 2022 chinery, New York, NY, USA, 85–100. https://doi.org/10.1145/2043556.2043566
from https://www.oracle.com/database/technologies/virtual-private-db.html [38] Christian Priebe, Kapil Vaswani, and Manuel Costa. 2018. EnclaveDB: A secure
[18] Victor Costan and Srinivas Devadas. 2016. Intel sgx explained. IACR Cryptol. database using SGX. In Proceedings of the 2018 IEEE Symposium on Security and
ePrint Arch. 2016, 86 (2016), 1–118. Privacy (S&P ’18). IEEE Computer Society, USA, 264–278.
[19] Lei Deng, Jerry Gao, and Chandrasekar Vuppalapati. 2015. Building a Big Data [39] Kui Ren, Yu Guo, Jiaqi Li, Xiaohua Jia, Cong Wang, Yajin Zhou, Sheng Wang,
Analytics Service Framework for Mobile Advertising and Marketing. In First Ning Cao, and Feifei Li. 2020. HybrIDX: New Hybrid Index for Volume-hiding
IEEE International Conference on Big Data Computing Service and Applications, Range Queries in Data Outsourcing Services. In 2020 IEEE 40th International
BigDataService 2015, Redwood City, CA, USA, March 30 - April 2, 2015. IEEE Conference on Distributed Computing Systems (ICDCS). 23–33. https://doi.org/10.
1109/ICDCS47774.2020.00014

4150
[40] Xuanle Ren, Le Su, Zhen Gu, Sheng Wang, Feifei Li, Yuan Xie, Song Bian, Chao [47] Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li,
Li, and Fan Zhang. 2023. HEDA: Multi-Attribute Unbounded Aggregation over and Yuanzhe Cai. 2020. AnalyticDB-V: A Hybrid Analytical Engine towards
Homomorphically Encrypted Database. Proc. VLDB Endow. 16, 4 (dec 2023), Query Fusion for Structured and Unstructured Data. Proc. VLDB Endow. 13, 12
601–614. https://doi.org/10.14778/3574245.3574248 (aug 2020), 3152–3165.
[41] Yuanyuan Sun, Sheng Wang, Huorong Li, and Feifei Li. 2021. Building Enclave- [48] Wikipedia. 2015. Consistency Model. https://en.wikipedia.org/wiki/Consistency_
Native Storage Engines for Practical Encrypted Databases. Proceedings of the model.
VLDB Endowment 14, 6 (Feb. 2021), 1019–1032. https://doi.org/10.14778/3447689. [49] Xinjun Yang, Yingqiang Zhang, Hao Chen, Chuan Sun, Feifei Li, and Zhou
3447705 Wenchao. 2023. PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency
[42] MELLANOX TECHNOLOGIES. [n.d.]. RDMA aware networks programming for Strongly Consistent Reads. Proc. VLDB Endow. 16, 12 (Aug. 2023).
user manual. http://www.mellanox.com/related-docs/prod_software/RDMA_ [50] Yupeng Zhang, Daniel Genkin, Jonathan Katz, Dimitrios Papadopoulos, and
Aware_Programming_user_manual.pdf. Charalampos Papamanthou. 2017. vSQL: Verifying Arbitrary SQL Queries over
[43] The Transaction Processing Council. 2007. TPC-E Benchmark. http://tpc.org/ Dynamic Outsourced Databases. In 2017 IEEE Symposium on Security and Privacy,
tpce/. "[accessed-April-2022]". SP 2017, San Jose, CA, USA, May 22-26, 2017. IEEE Computer Society, 863–880.
[44] Alejandro Vera-Baquero, Ricardo Colomo-Palacios, and Owen Molloy. 2016. [51] Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou. 2015. IntegriDB:
Real-time business activity monitoring and analysis of process performance on Verifiable SQL for Outsourced Databases. In Proceedings of the 22nd ACM SIGSAC
big-data domains. Telematics and Informatics 33, 3 (2016), 793–807. Conference on Computer and Communications Security, Denver, CO, USA, October
[45] Jianying Wang, Tongliang Li, Haoze Song, Xinjun Yang, Wenchao Zhou, Feifei 12-16, 2015. ACM, 1480–1491.
Li, Baoyue Yan, Qianqian Wu, Yukun Liang, Chengjun Ying, Yujie Wang, Baokai [52] Jun Zhou, Xiaolong Li, Peilin Zhao, Chaochao Chen, Longfei Li, Xinxing Yang,
Chen, Chang Cai, Yubin Ruan, Xiaoyi Weng, Shibin Chen, Liang Yin, Chengzhong Qing Cui, Jin Yu, Xu Chen, Yi Ding, and Yuan Alan Qi. 2017. KunPeng: Parameter
Yang, Xin Cai, Hongyan Xing, Nanlong Yu, Xiaofei Chen, Dapeng Huang, and Server Based Distributed Learning Systems and Its Applications in Alibaba and
Jianling Sun. 2023. PolarDB-IMCI: A Cloud-Native HTAP Database System at Ant Financial. In Proceedings of the 23rd ACM SIGKDD International Conference
Alibaba. Proc. ACM Manag. Data 1, 2 (2023), 199:1–199:25. https://doi.org/10. on Knowledge Discovery and Data Mining (KDD ’17). Association for Computing
1145/3589785 Machinery, New York, NY, USA, 1693–1702.
[46] Sheng Wang, Yiran Li, Huorong Li, Feifei Li, Chengjin Tian, Le Su, Yanshan [53] Wenchao Zhou, Yifan Cai, Yanqing Peng, Sheng Wang, Ke Ma, and Feifei Li.
Zhang, Yubing Ma, Lie Yan, Yuanyuan Sun, et al. 2022. Operon: An encrypted 2021. VeriDB: An SGX-Based Verifiable Database (SIGMOD ’21). Association for
database for ownership-preserving data management. Proceedings of the VLDB Computing Machinery, New York, NY, USA, 2182–2194. https://doi.org/10.1145/
Endowment 15, 12 (2022), 3332–3345. 3448016.3457308

4151

You might also like