Cloud Data Management Condensed 9 Pages
Cloud Data Management Condensed 9 Pages
https://doi.org/10.1007/s13222-025-00494-9
SCHWERPUNKTBEITRAG
Received: 14 October 2024 / Accepted: 8 January 2025 / Published online: 5 March 2025
© The Author(s) 2025
Abstract
The paper discusses the lessons learned from building Snowflake, a data management system for the cloud. Given the
need for systems that can scale to handle large data volumes, provide expressive programming interfaces, and leverage
the benefits of cloud computing, it describes the architecture of a cloud-based data management system and optimization
techniques specific to the cloud. Key techniques include pruning large file sets at both compile time and query runtime,
optimizing data layouts in the background, and, more generally, the importance of performing maintenance tasks in the
background, which is enabled by cloud resources. The paper also explains the need for using immutable files and the
implications for data modification queries. Finally, it highlights the operational aspects of building and maintaining a data
management system that functions as an online cloud service. The paper concludes by outlining future directions for
cloud-based data management systems.
Keywords Data management · Cloud computing · Data warehouses · Data platforms · Database query processing
K
18 Datenbank-Spektrum (2025) 25:17–28
alized they could rent out excess storage and compute ca- ing data in the cloud. The interface of storage services
pacity to external customers. This innovation, known as is simple, offering basic put and get operations to write
“cloud computing”, was first popularized by Amazon with and read files. Files can also be read partially. Listing files
its S3 storage service and EC2 compute service. Compared in a directory is possible but typically a slow operation,
to traditional on-premise systems, cloud computing offers which requires alternative file cataloging solutions to ma-
two main advantages: nage many files efficiently. Examples of cloud storage ser-
vices include AWS S3 [5], Azure Blob Storage [6], and
1. (De facto) Unlimited scalability and flexibility: Cus-
Google Cloud Storage [7].
tomers can rent storage and compute resources as needed,
Cloud compute services allow customers to rent virtual
dynamically scaling them without the need to pre-plan
machines (VMs) equipped with optimized hardware de-
for peak demands.
signed to meet diverse performance needs. Compute service
2. Managed resources: Cloud providers handle the mainte-
offerings are typically grouped into VM families, such as
nance of hardware and software, including automatic up-
“compute optimized”, “network optimized”, and “general
dates, security patches, and performance enhancements.
purpose”, and different sizes within each family, ranging
This relieves customers of the complexities of managing
from a single virtual CPU and a proportional fraction of
their own IT infrastructure.
other resources to a full physical machine. Optionally, VMs
In this paper, drawing on our decade-long experience may be equipped with specialized hardware such as GPUs
building and operating Snowflake [4], one of the largest and a share of the machine’s local SSDs or disks. The inter-
cloud-based data management systems, we explore how face to cloud compute services consists of machine images,
modern systems leverage these advantages to efficiently typically Linux distributions with customer-provided soft-
process large volumes of data. In particular, we describe ware and configuration scripts that are executed upon boot-
a reference architecture for a cloud-based data management ing the machine image. It is up to the customer-provided
system and explain how to achieve efficient query execu- software to coordinate distributed applications, for exam-
tion and data manipulation within this architecture. Key ple, by using networking protocols or files and database
techniques include pruning large file sets at both compile services as shared state. Examples of cloud compute ser-
time and query runtime, as well as optimizing data layouts vices include AWS EC2 [8], Azure Virtual Machines [9],
in the background. Additionally, we delve into the practical and Google Compute Engine [10].
aspects of developing and operating such a system at scale. There are higher-level cloud services, such as server-
Finally, we discuss industry trends that may influence fu- less functions, queuing services, and small relational and
ture research. Topics such as security and advanced query key-value databases. While these services can be useful for
optimization techniques are beyond the scope of this paper. building scalable applications, they tend to play a minor
role in building a data management system for the cloud
due to significant differences in interface, performance, and
2 Background pricing across cloud providers. Other higher-level services,
such as notification services, key management services, and
2.1 Cloud Computing virtual private clouds, are valuable and commonly utilized
by cloud-based data management systems.
Cloud computing offers almost unlimited scalability and
flexibility. Customers can rent as much storage space and 2.2 Observations on Cloud Computing
as many compute resources as they need, only for as long
as they need them. These characteristics have changed how Lower Consistency Storage Cloud storage services may
software applications can be scaled on-demand, allow- provide lower consistency guarantees than hard disks in
ing “internet-sized” applications to be built and deployed on-premises data centers. For example, file writes in AWS
rapidly. Cloud computing services include storage services, S3 used to be eventually consistent, meaning that writes to
compute services, and higher-level services. In terms of a file in S3 did not become immediately visible to readers.
storage and compute, this section focuses specifically on With a product upgrade in 2020, AWS S3 now provides
widely used cloud storage and compute services, such as strong read-your-write consistency for updates to a file [11].
file storage and cloud VMs. These services are commonly However, in our view, it is still advisable to assume lower
offered by providers with similar features and performance consistency guarantees and design the service accordingly
characteristics. The discussion of other cloud storage and to ensure it works across all cloud providers, even if some
compute services is beyond the scope of this paper. offer improved consistency.
Cloud storage services provide highly scalable storage Lower consistency guarantees require a different design
for data in files. Files remain the fundamental unit for stor- for managing files in a cloud-native data management sys-
K
Datenbank-Spektrum (2025) 25:17–28 19
Slow Compute Startup Times Cloud compute services have of these higher-level services across cloud providers led us
interesting characteristics as well. The provisioning time for to avoid this approach.
virtual machines in the cloud is relatively long, typically in
the order of minutes. This can lead to a poor customer
experience if compute clusters are not available instantly. 3 Building a Data Management System for
Therefore, it is essential to maintain a pool of already started the Cloud
virtual machines that can be quickly swapped in. The size
of this free pool depends on customer demand, and deter- Given the rapid growth in data sizes, the need for higher-
mining how to adaptively size the pool to avoid both over- level programming languages, and the characteristics of
provisioning and under-provisioning is an interesting opti- cloud computing discussed in the previous section, the chal-
mization problem [14]. Additionally, it is noteworthy that lenge is how to build a data management system optimized
virtual machines rarely fail. Nonetheless, it is advisable to for the cloud. Our goal is to design an architecture that
design systems with fault tolerance to handle potential hard- can scale to handle large data sizes, offer elasticity for cus-
ware and software machine failures. An advantage of such tomers to easily adjust resources, and efficiently process
a design is the support for seamlessly replacing virtual ma- complex queries over large datasets.
chines with newer ones that have updated software. This We assume this system is built on a cloud platform that
capability is a crucial component of the rolling upgrade offers two essential services: a storage service for reliable
mechanism required for running a data management ser- file storage and retrieval, and a compute service that en-
vice in the cloud with continuous availability. ables the provisioning of compute nodes with predefined
hardware configurations. Additionally, we assume access
Metadata Management In addition to cloud storage ser- to a scalable, consistent data store, such as a distributed
vices for managing large volumes of data, there are estab- key-value store, which supports the transactional storage
lished technologies for storing metadata, such as dictionary and updating of metadata.
items (e.g., schemas, tables, and views), session data (e.g., Figure 1 illustrates a reference architecture for a data
session variables and parameters), and runtime state (e.g., management system in the cloud. The architecture con-
running and queued jobs). The requirements for metadata sists of three layers: a control plane layer, a compute plane
management differ from those for data management. Meta- layer, and a storage layer. The control plane is responsi-
data updates are frequent, involve small amounts of data, ble for the overall service operation, resource management,
and typically include point reads and updates across con- and user interaction. It contains a network load balancer
current users. Transactional key-value stores, such as Foun- (NLB), components for authenticating and authorizing re-
dationDB [15], deployed on cloud-based virtual machines, quests (Auth), management logic for provisioning cloud re-
are well-suited for this use case as they efficiently manage sources (Infra), and a metadata store that keeps dictionary
metadata at scale. Existing higher-level services, such as metadata and runtime state. The compute plane is respon-
cloud-hosted relational databases or key-value stores, can sible for executing user requests such as SQL queries or
serve as alternatives to self-managed metadata storage so- Python programs. It consists of clusters of compute nodes
lutions. However, the varying performance characteristics that can be brought online rapidly. Compute clusters may
vary in size and may be dedicated to or shared between in-
K
20 Datenbank-Spektrum (2025) 25:17–28
K
Datenbank-Spektrum (2025) 25:17–28 21
build side to the probe side and use that summary to prune
the files and rows of the probe table. This is illustrated in
Fig. 3b. This technique is generalizable to N-table joins.
For example, if a join of two tables results in a small in-
termediate result, joining this result with a third table can
potentially allow pruning more files from the third table’s
list based on the summary from the intermediate result.
The above technique can be compared to semi-join re-
duction [20], which, at the most abstract level, involves
Fig. 3 Runtime pruning by pushing down filters and TopK boundaries taking condensed information from one side of the join and
into the scan operator (a) and sharing a summary of results between using it to reduce the amount of data that must be exchanged
scan operators (b)
from the other side to compute the result. This can be im-
plemented in several, often complementary, ways: through
a few hundred files. In these cases, pruning is a particularly a proper semi-join (sending the entire join key column), via
effective tool for query optimization. a Bloom filter [21, 22], or through min-max-based dynamic
Pruning can occur both at query compile time, as shown pruning. Snowflake employs a combination of Bloom fil-
in the example above, and at query runtime. Runtime prun- ters for tuple-based filtering and join pruning for micro-
ing is possible when intermediate results limit the number partition-based pruning.
of files that need to be accessed in the next step of query
execution. There are several types of runtime pruning, in- 3.1.2 Optimizing Data Layouts
cluding filter pushdown pruning, TopK pruning, and join
pruning, which we illustrate below. The effectiveness of pruning depends heavily on customer
workloads and corresponding data layouts. The query com-
Filter Pushdown Pruning A standard query execution tech- piler can effectively utilize pruning only when the data lay-
nique is to push down query predicates into a table scan out aligns with the typical customer workload. Therefore,
operator, as illustrated in Fig. 3a. We can leverage these an optimization technique is to organize data layouts based
pushed-down filters to prune more data at query runtime. on user workloads. This technique is particularly effective
At compile time, pruning removes files from the list of for data management systems in the cloud because neces-
files to scan, as described above. However, files can contain sary compute clusters can be brought online in addition to
richer metadata than what is available at compile time. Files customer-utilized compute clusters, avoiding interference
can also be internally sliced in a way that allows pruning with customer workloads. Changes to data layouts can be
of sub-file units, as for example in the Parquet file format. committed optimistically (perhaps with low priority), fur-
Files in Parquet format may be internally sliced into “row ther avoiding interference with customer workloads. Shift-
groups”, the columns of which can be further sliced into ing maintenance work to background compute clusters is
“pages”. A Parquet file footer may contain min-max meta- a powerful advantage of data management systems in the
data for both row groups and pages. This metadata may cloud.
not be available at compile time. Therefore, at runtime, we In the above example, querying the data in Fig. 2a by
can prune sub-file units, significantly improving execution ID was effective because the data was sorted by ID. The
efficiency. query compiler could optimally prune the list of files. How-
ever, querying the same data in Fig. 2a by Name would not
TopK Pruning A TopK operator usually keeps a heap of result in optimal pruning. If the query compiler tried to
the Top-K elements. Let us call the last element in that prune the list of files for the query SELECT * FROM users
heap the “current boundary”. The operator will not accept WHERE Name = ‘Kasper’, it would not be able to prune
any element greater if the order is descending or lower if any files. The name “Kasper” falls within the min-max in-
the order is ascending than this boundary. Hence, as also dex ranges of all three files. Hence, the list of files that
illustrated in Fig. 3a, the TopK operator can continuously the compute cluster must retrieve from the storage layer
push down the current boundary to the table scan operator, consists of File 1, File 2, and File 3.
which leverages it to filter more files and rows. We observed If the customer workload typically involves filtering data
that TopK pruning can improve query runtime by up to two by Name, it makes sense to organize the table’s data layout
orders of magnitude [19]. differently, sorted by Name. Such a data layout is shown in
Fig. 2b. If the query compiler tries to prune the list of files
Join Pruning When joining two tables, the execution en- (i.e., File 4, File 5, File 6) for the query SELECT * FROM
gine can send a summary of the distinct values from the users WHERE Name = ‘Kasper’, it would be able to prune
K
22 Datenbank-Spektrum (2025) 25:17–28
files 4 and 6 because max.Name/ of File 4 is smaller than smaller than the average depth in Fig. 4a. Therefore, query
‘Kasper’ and mi n.Name/ of File 6 is larger than ‘Kasper’. performance when filtering data by Name is expected to be
This demonstrates that optimizing the data layout based on better for the clustered data layout in Fig. 4b.
customer workload is an effective strategy. The process to maintain a low clustering depth of a table
In practice, there are two main challenges: can be automated and shifted to the background to avoid
impacting customer foreground workloads. The background
1. Automatically determining a typical customer workload
process, running on a separate compute cluster from the
is difficult. This makes it challenging to decide which ta-
customer compute clusters, continuously monitors the clus-
bles to optimize and by which columns or expressions.
tering depth of a table that has a clustering key defined. For
2. Sorting large tables is computationally expensive. This
regions of the table that exceed a specific, system-defined
makes it challenging to find a cost-effective way to
threshold, the background process sorts the respective data
achieve a “good enough” data layout and to define a met-
files along the clustering axis. The process then replaces the
ric for what “good enough” means.
old files with the new files for each region. This replace-
While systems are increasingly taking steps to automati- ment can be executed as a low-priority, optimistic commit
cally understand and characterize customer workloads, it is that succeeds if it does not impact concurrently running
still common for customers to define the tables and columns customer transactions. Table regions can be processed in-
for which they want to optimize data layouts. For example, dependently in the background, completely transparent to
the Snowflake data management system allows customers the customer. Figure 5 illustrates this process.
to define a “clustering key”. A clustering key in Snowflake In summary, the goal of clustering is to achieve pre-
is a subset of columns in a table (or expressions on a table) dictable query performance by limiting the number of files
that are explicitly designated to co-locate the data in the that compute clusters need to retrieve from the storage layer.
table within the same files [23]. The chosen metric for optimizing the data layout of a table
Sorting large datasets is time-consuming, making it prac- is clustering depth, which measures the number of overlap-
tically infeasible for large production tables. To still achieve ping min-max ranges along the customer-defined clustering
an efficient data layout, a technique called clustering is key. In our experience, this method works well in practice;
used. Clustering maintains an approximate sort order based however, it remains an open research question whether clus-
on the columns (or expressions) defined by the clustering tering depth is the most optimal metric for modeling query
key. Unlike full sorting, clustering does not require a per- performance and pruning efficiency.
fect order but strives to optimize overall query performance.
The metric used to model query performance is called clus- 3.1.3 Materialization and Caching
tering depth. Clustering depth is the average depth (1 or
greater) of the overlapping micro-partitions with respect to Efficient query execution involves reusing previously com-
the clustering key [24]. Micro-partitions are analogous to pleted work. This includes reusing previously computed re-
files for this discussion. The fewer overlapping files based sults, utilizing materialized indexes, and leveraging caches
on min-max indexes on the customer-specified clustering wherever possible. Materialization and caching are espe-
key, the better the query performance when filtering data cially effective in a multi-tenant, multi-user data manage-
by the respective key. ment system in the cloud, as computed results, indexes,
To illustrate the clustering depth metric, Fig. 4 plots min- and cached data can be shared across users and, in cases
max ranges for the Name column for files 1 to 3 (Fig. 4a) of data sharing, sometimes across customers as well. Nat-
and files 4 to 6 (Fig. 4b). Given the predicate Name = urally, reusing cached or materialized data must comply
‘Kasper’, the number of overlapping files at this point on with defined access control policies and undergo the sys-
the Name axis equals three in Fig. 4a and one in Fig. 4b. tem’s standard authorization checks. For instance, a cached
The average depth of the overlapping files in Fig. 4b is result originally computed by user A should be visible to
K
Datenbank-Spektrum (2025) 25:17–28 23
user B only if user B has the same or higher privileges tion into the database system’s design. Cloud-based data
on that data as user A. However, these security aspects are management systems are no exception, and their DML
beyond the scope of this paper. support is closely coupled with other design decisions,
One of the most effective ways to reuse work is by lever- such as the use of immutable files.
aging previously computed results. When the data manage- Systems built on immutable files handle deletions via
ment system receives a query that was executed before and a copy-on-write mechanism. When a row is deleted from
meets certain conditions (e.g., read-only data access, no a file, the remaining rows are copied into a new file. Updates
changes to the underlying data since the previous execu- are treated as a combination of deletes and inserts. While
tion, and proper access rights), the system can return the this approach works well for bulk operations, it can create
previously computed result without needing to compile and unnecessary overhead for scattered deletions. Specifically,
execute the query again. This requires persisting results in if only a few rows are deleted from a file, the majority
files within the storage layer so they can be efficiently re- of the file still needs to be rewritten, leading to additional
trieved for later reuse. Typically, results are stored in files compute time and increased storage overhead. To mitigate
within the storage layer to facilitate fault-tolerant streaming this and improve the performance of DML operations, some
of query results to a client. Result reuse, therefore, takes systems use “merge-on-read”, a technique that employs an
advantage of already persisted query outcomes. The key optional auxiliary file for each data file. The auxiliary file
challenges of result reuse are twofold: Constructing suc- stores the row IDs of deleted rows in a format referred
cinct query fingerprints (e.g., hash values) used to compare to as deletion vectors [12, 13]. Subsequent updates to the
old and new queries in a robust (with regards to collisions) same data file then apply the copy-on-write process using
and secure way; and a fast, online discoverability mecha- the smaller auxiliary file, which is updated accordingly.
nism for matching cached results (e.g., a metadata index by DML statements produce new versions of a table. Table
query fingerprint). versions consist of listings of the data files and possibly
Caches help reuse operations across all parts of a data auxiliary files. Multiple versions can then be used for con-
management system. Examples include file caches on com- currency control using snapshot semantics. When the sys-
pute clusters, min-max index caches in the query compiler, tem receives a query, it assigns it to a specific version of
and caches for dictionary metadata. File caching on com- the table, known as the snapshot. Query compilation and
pute clusters involves storing files downloaded from the execution operate exclusively on this snapshot, without be-
storage layer on the SSDs of the machines on the com- ing affected by newer versions of the table. This approach
pute cluster. Since files are immutable, cache coherence is works well for concurrent read-only queries, as they do not
straightforward: if the file needed by the machine exists on conflict with each other. However, in the case of concurrent
the SSD, the machine can read the file from the SSD, as it is DML statements, which produce new files and new table
always up-to-date. The cache in the query compiler stores versions, a transaction manager must detect and resolve
the list of min-max indexes in main memory for frequently conflicts.
accessed tables. This may require large amounts of memory,
so only min-max indexes of the most frequently used ta- 3.2.1 Transaction Handling
bles are cached. Cache warming min-max indexes can help
avoid customer-noticeable slowdowns in query compilation Detecting conflicts in concurrently running DML state-
when new machines are swapped into the system. ments or multi-statement transactions involving multiple
DML statements is challenging. The simplest scenario is
3.2 Efficient DML Execution concurrent append-only operations, such as INSERTs that
add files to the table without reading any existing files.
The efficient implementation of data modification state- Append-only statements can run in parallel and be com-
ments (DMLs), such as SQL’s INSERT, MERGE, UP- mitted without conflicts if primary key constraints are not
DATE, and DELETE, generally requires thorough integra- enforced. However, if concurrent DML statements or multi-
K
24 Datenbank-Spektrum (2025) 25:17–28
statement transactions (both hereafter referred to simply larger files. One option is to perform this consolidation
as transactions) involve reading files from the same table in the foreground, where the DML reads existing files,
in addition to writing, a transaction manager must detect writes new files with the additional data, and produces
conflicts and allow only non-conflicting transactions to a new table version that replaces the old files. However,
be committed. Aborted transactions can be transparently this slows down DML execution. A better option is to
recompiled and re-executed for the user until a successful consolidate files in the background. This background ser-
commit occurs or a system-defined retry limit is reached. vice merges small files into larger files that are easier to
If the retry limit is exceeded, the abort is returned to the manage and more efficient to read. The system then swaps
user as an error. the old files for the new ones through a DML operation
The transaction manager can run either as a centralized committed to the table. To minimize disruption, this DML
component within the control plane layer of the data man- can be given lower priority than customer DMLs, allow-
agement system or as distributed logic across compute clus- ing the transaction manager to abort the file consolidation
ters, relying on a centralized transactional runtime state in DML if conflicts arise. This approach reduces user-visible
the metadata store. Due to the use of immutable files, the performance impacts on foreground DML operations.
transaction manager detects conflicts at the file level rather
than the row level. This reduces the amount of state the File Deletion Background Service When files are deleted
transaction manager needs to maintain per transaction (at from a table, for example during the execution of DELETE
the cost of detecting some false-positive conflicts). or UPDATE statements, the deletion of the physical files
In general, any transaction conflict avoidance or detec- from cloud storage can be handled in the background. This
tion technique, such as pessimistic locking or optimistic allows files to be retained for the duration of the data re-
locking, can be used. Conflict resolution typically involves tention period, enabling features such as time travel. It also
recompiling and re-executing the conflicting queries. Ul- reduces the amount of work during the foreground execu-
timately, the transaction manager creates a new table ver- tion of a query. Deleting files is surprisingly challenging
sion when committing a transaction, which is persisted in in a cloud environment. In cloud-scale data management
the metadata store. Because writes to the metadata store systems, where the number of files to be deleted is high,
are themselves transactional, it is possible to persist new deleting files individually can lead to rate-throttling errors
versions for multiple tables simultaneously, enabling sup- from the cloud storage provider due to the high request rate.
port for multi-table transactions. Conflict detection is the Therefore, files must be deleted in batches. For example,
same for both single-table and multi-table transactions. In the AWS S3 API allows the deletion of up to 1,000 files
this capacity, the transaction manager acts as the consensus per request [25]. Additionally, parallelizing file deletion
component of the data management system. requests across multiple compute nodes can help avoid
rate-throttling errors. Finally, deletion requests may fail
3.2.2 Background File Maintenance and must be retried. Batching, parallelization, and request
retries are best managed by a separate background service
The compute layer of a cloud-based data management sys- that operates independently of foreground query execution.
tem allows for reserving compute resources to perform
maintenance tasks in the background. Shifting maintenance 3.3 Efficient Development and Operation of the
tasks to the background eliminates user-visible performance System
impacts while optimizing the system. One example is the
clustering service mentioned in the previous section. There In the previous sections, we covered the fundamentals of
are other important background services for file mainte- building a cloud data management system and techniques
nance: for efficient data processing. This chapter shifts to the oper-
ational side, focusing on key considerations for running and
File Consolidation Background Service The need for im- developing a cloud-based data management service. Unlike
mutable files can result in the creation of many small files, traditional on-premise databases, cloud systems are offered
especially in cases like a single-row insert statement (e.g., as a service. Here, the developer also serves as the provider,
INSERT INTO users VALUES (6, ’Kasper’)). This insert being responsible for software maintenance, resource man-
produces a new data file containing only one row, as shown agement, and critical functions like security, billing, and
in Fig. 2a. Having many small files can degrade perfor- availability. This model frees customers from managing
mance: compute clusters need to retrieve many files from their own databases, allowing them to focus on their core
storage, slowing query execution, and query compilers business. We explore the following three core aspects: Con-
must handle large file metadata, slowing query compila- figuration Management, Testing & Monitoring, and Data-
tion. To address this, small files must be consolidated into Driven Development.
K
28 Datenbank-Spektrum (2025) 25:17–28
17. Moerkotte G (1998) Small materialized aggregates: a light weight 32. Katahanas L, Talapady CB, Rowe J, Zhang F, Draves R, Fried-
index structure for data warehousing. VLDB. man M, Filho SMI, Kumar A (2021) The cosmos big data platform
18. Edara P, Pasumansky M (2021) Big metadata: when metadata is big at Microsoft: over a decade of progress and a decade to look for-
data. VLDB. ward. VLDB.
19. Chilukuri Y, Waack J, Zimmerer A Say hello to superfast top-K 33. Brantner M, Florescu D, Graf D, Kossmann D, Kraska T (2008)
queries in snowflake. https://www.snowflake.com/en/blog/super- Building a database on S3. ACM SIGMOD.
fast-top-k-queries. Accessed 2024-10-02 34. Gupta A, Agarwal D, Tan D, Kulesza J, Pathak R, Stefani S, Srini-
20. Bernstein PA, Chiu D-MW (1981) Using semi-joins to solve rela- vasan V (2015) Amazon Redshift and the case for simpler data
tional queries. J ACM 28(1) warehouses. ACM SIGMOD.
21. Mackert LF, Lohman GM (1986) R* optimizer validation and per- 35. Armenatzoglou N, Basu S, Bhanoori N, Cai M, Chainani N,
formance evaluation for local queries. ACM SIGMOD. Chinta K, Govindaraju V, Green TJ, Gupta M, Hillig S et al (2022)
22. Mullin JK (1990) Optimal semijoins for distributed database sys- Amazon Redshift re-invented. ACM SIGMOD.
tems. IEEE Transactions on Software Engineering. 36. Aguilar-Saborit J, Ramakrishnan R, Srinivasan K, Bocksrocker K,
23. Snowflake documentation: clustering keys & clustered tables. Alagiannis I, Sankara M, Shafiei M, Blakeley J, Dasarathy G,
https://docs.snowflake.com/en/user-guide/tables-clustering-keys Dash S et al (2020) POLARIS: The distributed SQL engine in
24. Snowflake documentation: micro-partitions & data clustering. Azure Synapse. VLDB.
https://docs.snowflake.com/en/user-guide/tables-clustering- 37. Google BigQuery. https://cloud.google.com/bigquery
micropartitions 38. Armbrust M, Ghodsi A, Xin R, Zaharia M (2021) Lakehouse: a new
25. Amazon S3 API Reference: DeleteObjects. https://docs.aws. generation of open platforms that unify data warehousing and ad-
amazon.com/AmazonS3/latest/API/API_DeleteObjects.html. Ac- vanced analytics. CIDR.
cessed 2024-10-02 39. Databricks SQL. https://www.databricks.com/product/databricks-
26. Yan J, Jin Q, Jain S, Viglas SD, Lee A (2018) Snowtrail: testing sql
with production queries on a cloud database. Workshop on Testing 40. Oracle Autonomous Data Warehouse. https://www.oracle.com/
Database Systems. autonomous-database/autonomous-data-warehouse
27. Dreseler M (2023) How building an industry DBMS differs from 41. IBM Db2. https://www.ibm.com/db2
building a research one. Sponsor Talk, CIDR. https://www.cidrdb. 42. Teradata. https://www.teradata.com
org/cidr2023/slides/sponsor-talk-snowflake-slides.pdf 43. Müller I, Marroquín R, Alonso G (2020) Lambada: interactive data
28. Aboulnaga A, Salem K, Soror AA, Minhas UF, Kokosielis P, Ka- analytics on cold data using serverless cloud infrastructure. ACM
math S (2009) Deploying database appliances in the cloud. IEEE SIGMOD.
Data Eng Bull 32(1) 44. Apache Parquet. https://parquet.apache.org
29. Abadi DJ (2009) Data management in the cloud: Limitations and 45. Apache Iceberg. https://iceberg.apache.org
opportunities. IEEE Data Eng Bull 32(1)
30. Narasayya VR, Chaudhuri S (2021) Cloud data services: Work- Publisher’s Note Springer Nature remains neutral with regard to juris-
loads, architectures and multi-tenancy. Found Trends Databases dictional claims in published maps and institutional affiliations.
10(1)
31. Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M,
Vassilakis T (2010) Dremel: interactive analysis of web-scale
datasets. VLDB.