0% found this document useful (0 votes)
49 views9 pages

Cloud Data Management Condensed 9 Pages

The paper outlines the development of Snowflake, a cloud-based data management system, focusing on its architecture and optimization techniques for handling large data volumes. Key strategies discussed include pruning large file sets, optimizing data layouts, and using immutable files to enhance consistency and performance. The authors also address operational challenges and future directions for cloud-based data management systems.

Uploaded by

reddykrish144
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views9 pages

Cloud Data Management Condensed 9 Pages

The paper outlines the development of Snowflake, a cloud-based data management system, focusing on its architecture and optimization techniques for handling large data volumes. Key strategies discussed include pruning large file sets, optimizing data layouts, and using immutable files to enhance consistency and performance. The authors also address operational challenges and future directions for cloud-based data management systems.

Uploaded by

reddykrish144
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Datenbank-Spektrum (2025) 25:17–28

https://doi.org/10.1007/s13222-025-00494-9

SCHWERPUNKTBEITRAG

Building a Data Management System for the Cloud: Lessons Learned


and Future Directions
Martin Hentschel1 · Jonathan Dees2 · Florian Funke2 · Max Heimel2 · Ismail Oukid2

Received: 14 October 2024 / Accepted: 8 January 2025 / Published online: 5 March 2025
© The Author(s) 2025

Abstract
The paper discusses the lessons learned from building Snowflake, a data management system for the cloud. Given the
need for systems that can scale to handle large data volumes, provide expressive programming interfaces, and leverage
the benefits of cloud computing, it describes the architecture of a cloud-based data management system and optimization
techniques specific to the cloud. Key techniques include pruning large file sets at both compile time and query runtime,
optimizing data layouts in the background, and, more generally, the importance of performing maintenance tasks in the
background, which is enabled by cloud resources. The paper also explains the need for using immutable files and the
implications for data modification queries. Finally, it highlights the operational aspects of building and maintaining a data
management system that functions as an online cloud service. The paper concludes by outlining future directions for
cloud-based data management systems.

Keywords Data management · Cloud computing · Data warehouses · Data platforms · Database query processing

1 Introduction duction pipelines across industries, from manufacturing to


logistics and retail. Online-first or online-only businesses
The prominence of cloud-based data management systems became the norm, disrupting traditional “offline” markets.
can be attributed to three main factors dating back to the Smart handheld devices constantly generate data, including
early 2000s: the exponential growth of data, commonly re- photos, videos, and location data. Online communication
ferred to as “big data”; the simplistic interfaces of early big produces vast amounts of data, from emails and social me-
data tools; and the emergence of cloud computing. dia posts to logs of browsing behavior. This led to an on-
With the rise of the internet in the early 2000s, the going explosion in global data volume [1], a phenomenon
amount of data generated worldwide began to explode. that came to be known as “big data”.
Business processes were streamlined through connected ap- Initially, processing big data was a manual, laborious
plications, and sensors started monitoring every step of pro- task. For example, engineers at Google had to implement
numerous applications by hand to process large volumes
of data. Abstracting away the complexity of parallelization,
 Martin Hentschel fault tolerance, and load balancing in these applications
[email protected] led to one of the first big data frameworks: MapReduce [2].
Jonathan Dees Soon after, open-source tools implementing the MapReduce
[email protected] paradigm became popular. The most prominent of these is
Florian Funke Apache Hadoop [3]. While these tools enabled big data
[email protected] processing, their usability was limited by simplistic inter-
Max Heimel faces, which were far removed from the declarative nature
[email protected] of SQL-based relational databases. Rewriting existing busi-
Ismail Oukid ness processes, often built on SQL or SQL-based tools,
[email protected] made large-scale adoption costly.
At the same time, to manage the growing volumes of
1
IT University of Copenhagen, Copenhagen, Denmark data, companies like Amazon and Google began building
2
Snowflake, Berlin, Germany increasingly larger data centers. These companies soon re-

K
18 Datenbank-Spektrum (2025) 25:17–28

alized they could rent out excess storage and compute ca- ing data in the cloud. The interface of storage services
pacity to external customers. This innovation, known as is simple, offering basic put and get operations to write
“cloud computing”, was first popularized by Amazon with and read files. Files can also be read partially. Listing files
its S3 storage service and EC2 compute service. Compared in a directory is possible but typically a slow operation,
to traditional on-premise systems, cloud computing offers which requires alternative file cataloging solutions to ma-
two main advantages: nage many files efficiently. Examples of cloud storage ser-
vices include AWS S3 [5], Azure Blob Storage [6], and
1. (De facto) Unlimited scalability and flexibility: Cus-
Google Cloud Storage [7].
tomers can rent storage and compute resources as needed,
Cloud compute services allow customers to rent virtual
dynamically scaling them without the need to pre-plan
machines (VMs) equipped with optimized hardware de-
for peak demands.
signed to meet diverse performance needs. Compute service
2. Managed resources: Cloud providers handle the mainte-
offerings are typically grouped into VM families, such as
nance of hardware and software, including automatic up-
“compute optimized”, “network optimized”, and “general
dates, security patches, and performance enhancements.
purpose”, and different sizes within each family, ranging
This relieves customers of the complexities of managing
from a single virtual CPU and a proportional fraction of
their own IT infrastructure.
other resources to a full physical machine. Optionally, VMs
In this paper, drawing on our decade-long experience may be equipped with specialized hardware such as GPUs
building and operating Snowflake [4], one of the largest and a share of the machine’s local SSDs or disks. The inter-
cloud-based data management systems, we explore how face to cloud compute services consists of machine images,
modern systems leverage these advantages to efficiently typically Linux distributions with customer-provided soft-
process large volumes of data. In particular, we describe ware and configuration scripts that are executed upon boot-
a reference architecture for a cloud-based data management ing the machine image. It is up to the customer-provided
system and explain how to achieve efficient query execu- software to coordinate distributed applications, for exam-
tion and data manipulation within this architecture. Key ple, by using networking protocols or files and database
techniques include pruning large file sets at both compile services as shared state. Examples of cloud compute ser-
time and query runtime, as well as optimizing data layouts vices include AWS EC2 [8], Azure Virtual Machines [9],
in the background. Additionally, we delve into the practical and Google Compute Engine [10].
aspects of developing and operating such a system at scale. There are higher-level cloud services, such as server-
Finally, we discuss industry trends that may influence fu- less functions, queuing services, and small relational and
ture research. Topics such as security and advanced query key-value databases. While these services can be useful for
optimization techniques are beyond the scope of this paper. building scalable applications, they tend to play a minor
role in building a data management system for the cloud
due to significant differences in interface, performance, and
2 Background pricing across cloud providers. Other higher-level services,
such as notification services, key management services, and
2.1 Cloud Computing virtual private clouds, are valuable and commonly utilized
by cloud-based data management systems.
Cloud computing offers almost unlimited scalability and
flexibility. Customers can rent as much storage space and 2.2 Observations on Cloud Computing
as many compute resources as they need, only for as long
as they need them. These characteristics have changed how Lower Consistency Storage Cloud storage services may
software applications can be scaled on-demand, allow- provide lower consistency guarantees than hard disks in
ing “internet-sized” applications to be built and deployed on-premises data centers. For example, file writes in AWS
rapidly. Cloud computing services include storage services, S3 used to be eventually consistent, meaning that writes to
compute services, and higher-level services. In terms of a file in S3 did not become immediately visible to readers.
storage and compute, this section focuses specifically on With a product upgrade in 2020, AWS S3 now provides
widely used cloud storage and compute services, such as strong read-your-write consistency for updates to a file [11].
file storage and cloud VMs. These services are commonly However, in our view, it is still advisable to assume lower
offered by providers with similar features and performance consistency guarantees and design the service accordingly
characteristics. The discussion of other cloud storage and to ensure it works across all cloud providers, even if some
compute services is beyond the scope of this paper. offer improved consistency.
Cloud storage services provide highly scalable storage Lower consistency guarantees require a different design
for data in files. Files remain the fundamental unit for stor- for managing files in a cloud-native data management sys-

K
Datenbank-Spektrum (2025) 25:17–28 19

tem: immutable files. Specifically, files should be written


once under a single name and never modified. This ap-
proach has several benefits. First, it avoids potential in-
consistencies when readers do not see the latest updates to
files. Second, it simplifies cache coherence: if a file is in the
cache, it is always current and never stale. Third, snapshot
isolation and time travel can be implemented easily by op-
erating on files associated with a specific snapshot in time.
Since files are never modified in place, the snapshot re-
mains consistent. However, there are also drawbacks, such
as longer processing times for updates and deletions due to
the necessary copy-on-write mechanism. Rows not deleted
must be copied to a new file under a different name. This is- Fig. 1 Reference architecture of a data management system in the
sue can be partially mitigated by using deletion vectors [12, cloud
13].

Slow Compute Startup Times Cloud compute services have of these higher-level services across cloud providers led us
interesting characteristics as well. The provisioning time for to avoid this approach.
virtual machines in the cloud is relatively long, typically in
the order of minutes. This can lead to a poor customer
experience if compute clusters are not available instantly. 3 Building a Data Management System for
Therefore, it is essential to maintain a pool of already started the Cloud
virtual machines that can be quickly swapped in. The size
of this free pool depends on customer demand, and deter- Given the rapid growth in data sizes, the need for higher-
mining how to adaptively size the pool to avoid both over- level programming languages, and the characteristics of
provisioning and under-provisioning is an interesting opti- cloud computing discussed in the previous section, the chal-
mization problem [14]. Additionally, it is noteworthy that lenge is how to build a data management system optimized
virtual machines rarely fail. Nonetheless, it is advisable to for the cloud. Our goal is to design an architecture that
design systems with fault tolerance to handle potential hard- can scale to handle large data sizes, offer elasticity for cus-
ware and software machine failures. An advantage of such tomers to easily adjust resources, and efficiently process
a design is the support for seamlessly replacing virtual ma- complex queries over large datasets.
chines with newer ones that have updated software. This We assume this system is built on a cloud platform that
capability is a crucial component of the rolling upgrade offers two essential services: a storage service for reliable
mechanism required for running a data management ser- file storage and retrieval, and a compute service that en-
vice in the cloud with continuous availability. ables the provisioning of compute nodes with predefined
hardware configurations. Additionally, we assume access
Metadata Management In addition to cloud storage ser- to a scalable, consistent data store, such as a distributed
vices for managing large volumes of data, there are estab- key-value store, which supports the transactional storage
lished technologies for storing metadata, such as dictionary and updating of metadata.
items (e.g., schemas, tables, and views), session data (e.g., Figure 1 illustrates a reference architecture for a data
session variables and parameters), and runtime state (e.g., management system in the cloud. The architecture con-
running and queued jobs). The requirements for metadata sists of three layers: a control plane layer, a compute plane
management differ from those for data management. Meta- layer, and a storage layer. The control plane is responsi-
data updates are frequent, involve small amounts of data, ble for the overall service operation, resource management,
and typically include point reads and updates across con- and user interaction. It contains a network load balancer
current users. Transactional key-value stores, such as Foun- (NLB), components for authenticating and authorizing re-
dationDB [15], deployed on cloud-based virtual machines, quests (Auth), management logic for provisioning cloud re-
are well-suited for this use case as they efficiently manage sources (Infra), and a metadata store that keeps dictionary
metadata at scale. Existing higher-level services, such as metadata and runtime state. The compute plane is respon-
cloud-hosted relational databases or key-value stores, can sible for executing user requests such as SQL queries or
serve as alternatives to self-managed metadata storage so- Python programs. It consists of clusters of compute nodes
lutions. However, the varying performance characteristics that can be brought online rapidly. Compute clusters may
vary in size and may be dedicated to or shared between in-

K
20 Datenbank-Spektrum (2025) 25:17–28

dividual customers. Finally, the storage layer is responsible


for reliably storing files. Typically, services like AWS S3
or Azure Storage suffice as storage layers.
The overall concept of this architecture is the separa-
tion of storage and compute. The benefit is that each layer
scales independently. Compute clusters can be added, re-
moved, or resized without the need to shuffle data. Storage
scales (almost) infinitely without the need to overprovision
compute instances. This allows for high flexibility based
on customer demand, such as total data volume, number of Fig. 2 Example table consisting of user data. The table has two
customers, and workload complexity. Additionally, this ar- columns, ID and Name, and six rows. The data is sorted by ID (a)
and by Name (b). The table is partitioned into three files, respectively
chitecture makes it easy to upgrade the software of the sep-
arate layers independently. For example, a compute cluster
with a newer software version can be started, and customer Pruning sets of files is typically done using statistics,
requests can be routed to this new cluster. This enables usually in the form of min-max indexes per column per file.
seamless version upgrades, which is essential when run- Min-max indexes are also known as zone maps, block range
ning a data management system as a service in the cloud. indexes, or small materialized aggregates [17]. Given min-
Finally, customers can be charged separately for compute max statistics for a particular column in each file, if a query
and storage, following the usage-based pricing model of includes a predicate on that column, the query compiler can
cloud computing services. exclude any files whose min-max ranges fall outside the
In the remainder of this section, we discuss three key predicate.
components of an efficient cloud-based data management As an example of file pruning, consider the table in
system. First, Sect. 3.1 covers efficient query execution, fo- Fig. 2a. The table represents user data and consists of two
cusing on pruning, data layout optimization, and caching. columns, ID and Name, and six rows. The table is sorted
Next, Sect. 3.2 outlines the essential elements of efficient in its natural insertion order with increasing IDs. Adding
DML execution, including transaction handling and back- one more user to the table would result in an additional row
ground file maintenance. Finally, Sect. 3.3 presents our in- with ID 7, for instance. In this example, the data in Fig. 2a
sights on efficiently operating a large system in the cloud, is split into three files, File 1 to File 3. These files are
addressing configuration management, testing, monitoring, stored in the storage layer of the data management system
and the role of data-driven development. in the cloud. Files may contain different numbers of rows
depending on the size of the rows and partitioning logic of
3.1 Efficient Query Execution the system when executing data modification queries.
Statistics, such as min-max indexes per column per file,
In architectures where storage is separate from compute, are stored separately from the data in the system’s meta-
efficiently handling large file volumes is crucial for opti- data store or in the storage layer. Storing pruning metadata
mizing query execution. In the following, we will discuss separately from data files (or in addition to in data files)
three important techniques for this: pruning large sets of is essential to make compile-time pruning cheap, as the
files using statistics, optimizing data layouts in the back- metadata that needs to be loaded at compile time is or-
ground, and reusing work. These methods complement tra- ders of magnitude smaller than the actual data. Managing
ditional query optimizations, such as join ordering, directly this metadata is a key challenge in cloud-based systems, as,
applicable in cloud-native systems. at large scale, metadata management itself becomes a data
management problem [18].
3.1.1 Pruning Large Sets of Files Consider the following SQL query: SELECT * FROM
users WHERE ID = 5. The query compiler loads the table’s
Since data does not reside on compute resources, these re- metadata from the metadata store. This metadata includes
sources must retrieve files from the storage layer. The fewer the list of files associated with the table and the min-max
files that need to be retrieved, the faster a query or program indexes of these files. Based on these indexes, the query
executes. The technique used to reduce the number of files compiler can prune File 1 and File 3 from the list of files,
is called pruning. Pruning has a significant impact on per- as the query predicate (ID = 5) allows ruling out any file
formance in cloud data management systems, as approxi- with max.ID/ < 5 and mi n.ID/ > 5. As a result, the com-
mately 50% of query execution time is typically spent on pute cluster only needs to retrieve File 2 from the storage
scanning and filtering [16]. layer during query execution. In practice, query compilers
prune lists of millions of files down to only a few dozen or

K
Datenbank-Spektrum (2025) 25:17–28 21

build side to the probe side and use that summary to prune
the files and rows of the probe table. This is illustrated in
Fig. 3b. This technique is generalizable to N-table joins.
For example, if a join of two tables results in a small in-
termediate result, joining this result with a third table can
potentially allow pruning more files from the third table’s
list based on the summary from the intermediate result.
The above technique can be compared to semi-join re-
duction [20], which, at the most abstract level, involves
Fig. 3 Runtime pruning by pushing down filters and TopK boundaries taking condensed information from one side of the join and
into the scan operator (a) and sharing a summary of results between using it to reduce the amount of data that must be exchanged
scan operators (b)
from the other side to compute the result. This can be im-
plemented in several, often complementary, ways: through
a few hundred files. In these cases, pruning is a particularly a proper semi-join (sending the entire join key column), via
effective tool for query optimization. a Bloom filter [21, 22], or through min-max-based dynamic
Pruning can occur both at query compile time, as shown pruning. Snowflake employs a combination of Bloom fil-
in the example above, and at query runtime. Runtime prun- ters for tuple-based filtering and join pruning for micro-
ing is possible when intermediate results limit the number partition-based pruning.
of files that need to be accessed in the next step of query
execution. There are several types of runtime pruning, in- 3.1.2 Optimizing Data Layouts
cluding filter pushdown pruning, TopK pruning, and join
pruning, which we illustrate below. The effectiveness of pruning depends heavily on customer
workloads and corresponding data layouts. The query com-
Filter Pushdown Pruning A standard query execution tech- piler can effectively utilize pruning only when the data lay-
nique is to push down query predicates into a table scan out aligns with the typical customer workload. Therefore,
operator, as illustrated in Fig. 3a. We can leverage these an optimization technique is to organize data layouts based
pushed-down filters to prune more data at query runtime. on user workloads. This technique is particularly effective
At compile time, pruning removes files from the list of for data management systems in the cloud because neces-
files to scan, as described above. However, files can contain sary compute clusters can be brought online in addition to
richer metadata than what is available at compile time. Files customer-utilized compute clusters, avoiding interference
can also be internally sliced in a way that allows pruning with customer workloads. Changes to data layouts can be
of sub-file units, as for example in the Parquet file format. committed optimistically (perhaps with low priority), fur-
Files in Parquet format may be internally sliced into “row ther avoiding interference with customer workloads. Shift-
groups”, the columns of which can be further sliced into ing maintenance work to background compute clusters is
“pages”. A Parquet file footer may contain min-max meta- a powerful advantage of data management systems in the
data for both row groups and pages. This metadata may cloud.
not be available at compile time. Therefore, at runtime, we In the above example, querying the data in Fig. 2a by
can prune sub-file units, significantly improving execution ID was effective because the data was sorted by ID. The
efficiency. query compiler could optimally prune the list of files. How-
ever, querying the same data in Fig. 2a by Name would not
TopK Pruning A TopK operator usually keeps a heap of result in optimal pruning. If the query compiler tried to
the Top-K elements. Let us call the last element in that prune the list of files for the query SELECT * FROM users
heap the “current boundary”. The operator will not accept WHERE Name = ‘Kasper’, it would not be able to prune
any element greater if the order is descending or lower if any files. The name “Kasper” falls within the min-max in-
the order is ascending than this boundary. Hence, as also dex ranges of all three files. Hence, the list of files that
illustrated in Fig. 3a, the TopK operator can continuously the compute cluster must retrieve from the storage layer
push down the current boundary to the table scan operator, consists of File 1, File 2, and File 3.
which leverages it to filter more files and rows. We observed If the customer workload typically involves filtering data
that TopK pruning can improve query runtime by up to two by Name, it makes sense to organize the table’s data layout
orders of magnitude [19]. differently, sorted by Name. Such a data layout is shown in
Fig. 2b. If the query compiler tries to prune the list of files
Join Pruning When joining two tables, the execution en- (i.e., File 4, File 5, File 6) for the query SELECT * FROM
gine can send a summary of the distinct values from the users WHERE Name = ‘Kasper’, it would be able to prune

K
22 Datenbank-Spektrum (2025) 25:17–28

Fig. 4 Plots of min-max ranges


for the Name column for Files 1
to 3 (a) and for Files 4 to 6 (b)

files 4 and 6 because max.Name/ of File 4 is smaller than smaller than the average depth in Fig. 4a. Therefore, query
‘Kasper’ and mi n.Name/ of File 6 is larger than ‘Kasper’. performance when filtering data by Name is expected to be
This demonstrates that optimizing the data layout based on better for the clustered data layout in Fig. 4b.
customer workload is an effective strategy. The process to maintain a low clustering depth of a table
In practice, there are two main challenges: can be automated and shifted to the background to avoid
impacting customer foreground workloads. The background
1. Automatically determining a typical customer workload
process, running on a separate compute cluster from the
is difficult. This makes it challenging to decide which ta-
customer compute clusters, continuously monitors the clus-
bles to optimize and by which columns or expressions.
tering depth of a table that has a clustering key defined. For
2. Sorting large tables is computationally expensive. This
regions of the table that exceed a specific, system-defined
makes it challenging to find a cost-effective way to
threshold, the background process sorts the respective data
achieve a “good enough” data layout and to define a met-
files along the clustering axis. The process then replaces the
ric for what “good enough” means.
old files with the new files for each region. This replace-
While systems are increasingly taking steps to automati- ment can be executed as a low-priority, optimistic commit
cally understand and characterize customer workloads, it is that succeeds if it does not impact concurrently running
still common for customers to define the tables and columns customer transactions. Table regions can be processed in-
for which they want to optimize data layouts. For example, dependently in the background, completely transparent to
the Snowflake data management system allows customers the customer. Figure 5 illustrates this process.
to define a “clustering key”. A clustering key in Snowflake In summary, the goal of clustering is to achieve pre-
is a subset of columns in a table (or expressions on a table) dictable query performance by limiting the number of files
that are explicitly designated to co-locate the data in the that compute clusters need to retrieve from the storage layer.
table within the same files [23]. The chosen metric for optimizing the data layout of a table
Sorting large datasets is time-consuming, making it prac- is clustering depth, which measures the number of overlap-
tically infeasible for large production tables. To still achieve ping min-max ranges along the customer-defined clustering
an efficient data layout, a technique called clustering is key. In our experience, this method works well in practice;
used. Clustering maintains an approximate sort order based however, it remains an open research question whether clus-
on the columns (or expressions) defined by the clustering tering depth is the most optimal metric for modeling query
key. Unlike full sorting, clustering does not require a per- performance and pruning efficiency.
fect order but strives to optimize overall query performance.
The metric used to model query performance is called clus- 3.1.3 Materialization and Caching
tering depth. Clustering depth is the average depth (1 or
greater) of the overlapping micro-partitions with respect to Efficient query execution involves reusing previously com-
the clustering key [24]. Micro-partitions are analogous to pleted work. This includes reusing previously computed re-
files for this discussion. The fewer overlapping files based sults, utilizing materialized indexes, and leveraging caches
on min-max indexes on the customer-specified clustering wherever possible. Materialization and caching are espe-
key, the better the query performance when filtering data cially effective in a multi-tenant, multi-user data manage-
by the respective key. ment system in the cloud, as computed results, indexes,
To illustrate the clustering depth metric, Fig. 4 plots min- and cached data can be shared across users and, in cases
max ranges for the Name column for files 1 to 3 (Fig. 4a) of data sharing, sometimes across customers as well. Nat-
and files 4 to 6 (Fig. 4b). Given the predicate Name = urally, reusing cached or materialized data must comply
‘Kasper’, the number of overlapping files at this point on with defined access control policies and undergo the sys-
the Name axis equals three in Fig. 4a and one in Fig. 4b. tem’s standard authorization checks. For instance, a cached
The average depth of the overlapping files in Fig. 4b is result originally computed by user A should be visible to

K
Datenbank-Spektrum (2025) 25:17–28 23

Fig. 5 Clustering two separate


regions, R1 and R2, of a table
by sorting the respective data
files along the clustering axis.
The new regions have a lower
clustering depth than the old
regions

user B only if user B has the same or higher privileges tion into the database system’s design. Cloud-based data
on that data as user A. However, these security aspects are management systems are no exception, and their DML
beyond the scope of this paper. support is closely coupled with other design decisions,
One of the most effective ways to reuse work is by lever- such as the use of immutable files.
aging previously computed results. When the data manage- Systems built on immutable files handle deletions via
ment system receives a query that was executed before and a copy-on-write mechanism. When a row is deleted from
meets certain conditions (e.g., read-only data access, no a file, the remaining rows are copied into a new file. Updates
changes to the underlying data since the previous execu- are treated as a combination of deletes and inserts. While
tion, and proper access rights), the system can return the this approach works well for bulk operations, it can create
previously computed result without needing to compile and unnecessary overhead for scattered deletions. Specifically,
execute the query again. This requires persisting results in if only a few rows are deleted from a file, the majority
files within the storage layer so they can be efficiently re- of the file still needs to be rewritten, leading to additional
trieved for later reuse. Typically, results are stored in files compute time and increased storage overhead. To mitigate
within the storage layer to facilitate fault-tolerant streaming this and improve the performance of DML operations, some
of query results to a client. Result reuse, therefore, takes systems use “merge-on-read”, a technique that employs an
advantage of already persisted query outcomes. The key optional auxiliary file for each data file. The auxiliary file
challenges of result reuse are twofold: Constructing suc- stores the row IDs of deleted rows in a format referred
cinct query fingerprints (e.g., hash values) used to compare to as deletion vectors [12, 13]. Subsequent updates to the
old and new queries in a robust (with regards to collisions) same data file then apply the copy-on-write process using
and secure way; and a fast, online discoverability mecha- the smaller auxiliary file, which is updated accordingly.
nism for matching cached results (e.g., a metadata index by DML statements produce new versions of a table. Table
query fingerprint). versions consist of listings of the data files and possibly
Caches help reuse operations across all parts of a data auxiliary files. Multiple versions can then be used for con-
management system. Examples include file caches on com- currency control using snapshot semantics. When the sys-
pute clusters, min-max index caches in the query compiler, tem receives a query, it assigns it to a specific version of
and caches for dictionary metadata. File caching on com- the table, known as the snapshot. Query compilation and
pute clusters involves storing files downloaded from the execution operate exclusively on this snapshot, without be-
storage layer on the SSDs of the machines on the com- ing affected by newer versions of the table. This approach
pute cluster. Since files are immutable, cache coherence is works well for concurrent read-only queries, as they do not
straightforward: if the file needed by the machine exists on conflict with each other. However, in the case of concurrent
the SSD, the machine can read the file from the SSD, as it is DML statements, which produce new files and new table
always up-to-date. The cache in the query compiler stores versions, a transaction manager must detect and resolve
the list of min-max indexes in main memory for frequently conflicts.
accessed tables. This may require large amounts of memory,
so only min-max indexes of the most frequently used ta- 3.2.1 Transaction Handling
bles are cached. Cache warming min-max indexes can help
avoid customer-noticeable slowdowns in query compilation Detecting conflicts in concurrently running DML state-
when new machines are swapped into the system. ments or multi-statement transactions involving multiple
DML statements is challenging. The simplest scenario is
3.2 Efficient DML Execution concurrent append-only operations, such as INSERTs that
add files to the table without reading any existing files.
The efficient implementation of data modification state- Append-only statements can run in parallel and be com-
ments (DMLs), such as SQL’s INSERT, MERGE, UP- mitted without conflicts if primary key constraints are not
DATE, and DELETE, generally requires thorough integra- enforced. However, if concurrent DML statements or multi-

K
24 Datenbank-Spektrum (2025) 25:17–28

statement transactions (both hereafter referred to simply larger files. One option is to perform this consolidation
as transactions) involve reading files from the same table in the foreground, where the DML reads existing files,
in addition to writing, a transaction manager must detect writes new files with the additional data, and produces
conflicts and allow only non-conflicting transactions to a new table version that replaces the old files. However,
be committed. Aborted transactions can be transparently this slows down DML execution. A better option is to
recompiled and re-executed for the user until a successful consolidate files in the background. This background ser-
commit occurs or a system-defined retry limit is reached. vice merges small files into larger files that are easier to
If the retry limit is exceeded, the abort is returned to the manage and more efficient to read. The system then swaps
user as an error. the old files for the new ones through a DML operation
The transaction manager can run either as a centralized committed to the table. To minimize disruption, this DML
component within the control plane layer of the data man- can be given lower priority than customer DMLs, allow-
agement system or as distributed logic across compute clus- ing the transaction manager to abort the file consolidation
ters, relying on a centralized transactional runtime state in DML if conflicts arise. This approach reduces user-visible
the metadata store. Due to the use of immutable files, the performance impacts on foreground DML operations.
transaction manager detects conflicts at the file level rather
than the row level. This reduces the amount of state the File Deletion Background Service When files are deleted
transaction manager needs to maintain per transaction (at from a table, for example during the execution of DELETE
the cost of detecting some false-positive conflicts). or UPDATE statements, the deletion of the physical files
In general, any transaction conflict avoidance or detec- from cloud storage can be handled in the background. This
tion technique, such as pessimistic locking or optimistic allows files to be retained for the duration of the data re-
locking, can be used. Conflict resolution typically involves tention period, enabling features such as time travel. It also
recompiling and re-executing the conflicting queries. Ul- reduces the amount of work during the foreground execu-
timately, the transaction manager creates a new table ver- tion of a query. Deleting files is surprisingly challenging
sion when committing a transaction, which is persisted in in a cloud environment. In cloud-scale data management
the metadata store. Because writes to the metadata store systems, where the number of files to be deleted is high,
are themselves transactional, it is possible to persist new deleting files individually can lead to rate-throttling errors
versions for multiple tables simultaneously, enabling sup- from the cloud storage provider due to the high request rate.
port for multi-table transactions. Conflict detection is the Therefore, files must be deleted in batches. For example,
same for both single-table and multi-table transactions. In the AWS S3 API allows the deletion of up to 1,000 files
this capacity, the transaction manager acts as the consensus per request [25]. Additionally, parallelizing file deletion
component of the data management system. requests across multiple compute nodes can help avoid
rate-throttling errors. Finally, deletion requests may fail
3.2.2 Background File Maintenance and must be retried. Batching, parallelization, and request
retries are best managed by a separate background service
The compute layer of a cloud-based data management sys- that operates independently of foreground query execution.
tem allows for reserving compute resources to perform
maintenance tasks in the background. Shifting maintenance 3.3 Efficient Development and Operation of the
tasks to the background eliminates user-visible performance System
impacts while optimizing the system. One example is the
clustering service mentioned in the previous section. There In the previous sections, we covered the fundamentals of
are other important background services for file mainte- building a cloud data management system and techniques
nance: for efficient data processing. This chapter shifts to the oper-
ational side, focusing on key considerations for running and
File Consolidation Background Service The need for im- developing a cloud-based data management service. Unlike
mutable files can result in the creation of many small files, traditional on-premise databases, cloud systems are offered
especially in cases like a single-row insert statement (e.g., as a service. Here, the developer also serves as the provider,
INSERT INTO users VALUES (6, ’Kasper’)). This insert being responsible for software maintenance, resource man-
produces a new data file containing only one row, as shown agement, and critical functions like security, billing, and
in Fig. 2a. Having many small files can degrade perfor- availability. This model frees customers from managing
mance: compute clusters need to retrieve many files from their own databases, allowing them to focus on their core
storage, slowing query execution, and query compilers business. We explore the following three core aspects: Con-
must handle large file metadata, slowing query compila- figuration Management, Testing & Monitoring, and Data-
tion. To address this, small files must be consolidated into Driven Development.

K
28 Datenbank-Spektrum (2025) 25:17–28

17. Moerkotte G (1998) Small materialized aggregates: a light weight 32. Katahanas L, Talapady CB, Rowe J, Zhang F, Draves R, Fried-
index structure for data warehousing. VLDB. man M, Filho SMI, Kumar A (2021) The cosmos big data platform
18. Edara P, Pasumansky M (2021) Big metadata: when metadata is big at Microsoft: over a decade of progress and a decade to look for-
data. VLDB. ward. VLDB.
19. Chilukuri Y, Waack J, Zimmerer A Say hello to superfast top-K 33. Brantner M, Florescu D, Graf D, Kossmann D, Kraska T (2008)
queries in snowflake. https://www.snowflake.com/en/blog/super- Building a database on S3. ACM SIGMOD.
fast-top-k-queries. Accessed 2024-10-02 34. Gupta A, Agarwal D, Tan D, Kulesza J, Pathak R, Stefani S, Srini-
20. Bernstein PA, Chiu D-MW (1981) Using semi-joins to solve rela- vasan V (2015) Amazon Redshift and the case for simpler data
tional queries. J ACM 28(1) warehouses. ACM SIGMOD.
21. Mackert LF, Lohman GM (1986) R* optimizer validation and per- 35. Armenatzoglou N, Basu S, Bhanoori N, Cai M, Chainani N,
formance evaluation for local queries. ACM SIGMOD. Chinta K, Govindaraju V, Green TJ, Gupta M, Hillig S et al (2022)
22. Mullin JK (1990) Optimal semijoins for distributed database sys- Amazon Redshift re-invented. ACM SIGMOD.
tems. IEEE Transactions on Software Engineering. 36. Aguilar-Saborit J, Ramakrishnan R, Srinivasan K, Bocksrocker K,
23. Snowflake documentation: clustering keys & clustered tables. Alagiannis I, Sankara M, Shafiei M, Blakeley J, Dasarathy G,
https://docs.snowflake.com/en/user-guide/tables-clustering-keys Dash S et al (2020) POLARIS: The distributed SQL engine in
24. Snowflake documentation: micro-partitions & data clustering. Azure Synapse. VLDB.
https://docs.snowflake.com/en/user-guide/tables-clustering- 37. Google BigQuery. https://cloud.google.com/bigquery
micropartitions 38. Armbrust M, Ghodsi A, Xin R, Zaharia M (2021) Lakehouse: a new
25. Amazon S3 API Reference: DeleteObjects. https://docs.aws. generation of open platforms that unify data warehousing and ad-
amazon.com/AmazonS3/latest/API/API_DeleteObjects.html. Ac- vanced analytics. CIDR.
cessed 2024-10-02 39. Databricks SQL. https://www.databricks.com/product/databricks-
26. Yan J, Jin Q, Jain S, Viglas SD, Lee A (2018) Snowtrail: testing sql
with production queries on a cloud database. Workshop on Testing 40. Oracle Autonomous Data Warehouse. https://www.oracle.com/
Database Systems. autonomous-database/autonomous-data-warehouse
27. Dreseler M (2023) How building an industry DBMS differs from 41. IBM Db2. https://www.ibm.com/db2
building a research one. Sponsor Talk, CIDR. https://www.cidrdb. 42. Teradata. https://www.teradata.com
org/cidr2023/slides/sponsor-talk-snowflake-slides.pdf 43. Müller I, Marroquín R, Alonso G (2020) Lambada: interactive data
28. Aboulnaga A, Salem K, Soror AA, Minhas UF, Kokosielis P, Ka- analytics on cold data using serverless cloud infrastructure. ACM
math S (2009) Deploying database appliances in the cloud. IEEE SIGMOD.
Data Eng Bull 32(1) 44. Apache Parquet. https://parquet.apache.org
29. Abadi DJ (2009) Data management in the cloud: Limitations and 45. Apache Iceberg. https://iceberg.apache.org
opportunities. IEEE Data Eng Bull 32(1)
30. Narasayya VR, Chaudhuri S (2021) Cloud data services: Work- Publisher’s Note Springer Nature remains neutral with regard to juris-
loads, architectures and multi-tenancy. Found Trends Databases dictional claims in published maps and institutional affiliations.
10(1)
31. Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M,
Vassilakis T (2010) Dremel: interactive analysis of web-scale
datasets. VLDB.

You might also like