24 Distributed OLAP
Databases
Intro to Database Systems Andy Pavlo
15-445/15-645
Fall 2019 AP Computer Science
Carnegie Mellon University
2
ADMINISTRIVIA
Homework #5: Monday Dec 3rd @ 11:59pm
Project #4: Monday Dec 10th @ 11:59pm
Extra Credit: Wednesday Dec 10th @ 11:59pm
Final Exam: Monday Dec 9th @ 5:30pm
Systems Potpourri: Wednesday Dec 4th
→ Vote for what system you want me to talk about.
→ https://cmudb.io/f19-systems
CMU 15-445/645 (Fall 2019)
3
ADMINISTRIVIA
Monday Dec 2nd – Oracle Lecture
→ Shasank Chavan (VP In-Memory Databases)
Monday Dec 2nd – Oracle Systems Talk
→ 4:30pm in GHC 6115
→ Pizza will be served
Tuesday Dec 3rd – Oracle Research Talk
→ Hideaki Kimura (Oracle Beast)
→ 12:00pm in CIC 4th Floor (Panther Hollow Room)
→ Pizza will be served.
CMU 15-445/645 (Fall 2019)
4
LAST CLASS
Atomic Commit Protocols
Replication
Consistency Issues (CAP)
Federated Databases
CMU 15-445/645 (Fall 2019)
5
B I F U R C AT E D E N V I R O N M E N T
Extract
Transform
Load
OLTP Databases OLAP Database
CMU 15-445/645 (Fall 2019)
6
DECISION SUPPORT SYSTEMS
Applications that serve the management,
operations, and planning levels of an organization
to help people make decisions about future issues
and problems by analyzing historical data.
Star Schema vs. Snowflake Schema
CMU 15-445/645 (Fall 2019)
7
S TA R S C H E M A
PRODUCT_DIM CUSTOMER_DIM
CATEGORY_NAME ID
CATEGORY_DESC FIRST_NAME
PRODUCT_CODE SALES_FACT LAST_NAME
PRODUCT_NAME EMAIL
PRODUCT_DESC
PRODUCT_FK ZIP_CODE
TIME_FK
LOCATION_FK
CUSTOMER_FK
LOCATION_DIM TIME_DIM
COUNTRY PRICE YEAR
STATE_CODE QUANTITY DAY_OF_YEAR
STATE_NAME MONTH_NUM
ZIP_CODE MONTH_NAME
CITY DAY_OF_MONTH
CMU 15-445/645 (Fall 2019)
8
CAT_LOOKUP
CATEGORY_ID
SNOWFLAKE SCHEMA
CATEGORY_NAME
CATEGORY_DESC
CUSTOMER_DIM
PRODUCT_DIM ID
CATEGORY_FK SALES_FACT FIRST_NAME
PRODUCT_CODE LAST_NAME
PRODUCT_NAME PRODUCT_FK EMAIL
PRODUCT_DESC ZIP_CODE
TIME_FK
LOCATION_FK
LOCATION_DIM CUSTOMER_FK TIME_DIM
COUNTRY YEAR
STATE_FK DAY_OF_YEAR
ZIP_CODE PRICE MONTH_FK
CITY DAY_OF_MONTH
QUANTITY
STATE_LOOKUP MONTH_LOOKUP
STATE_ID MONTH_NUM
STATE_CODE MONTH_NAME
STATE_NAME MONTH_SEASON
CMU 15-445/645 (Fall 2019)
9
S TA R V S . S N O W F L A K E S C H E M A
Issue #1: Normalization
→ Snowflake schemas take up less storage space.
→ Denormalized data models may incur integrity and
consistency violations.
Issue #2: Query Complexity
→ Snowflake schemas require more joins to get the data
needed for a query.
→ Queries on star schemas will (usually) be faster.
CMU 15-445/645 (Fall 2019)
10
PROBLEM SETUP
SELECT * FROM R JOIN S Partitions
ON R.id = S.id
P1 P2
Application
Server P3 P4
CMU 15-445/645 (Fall 2019)
10
PROBLEM SETUP
SELECT * FROM R JOIN S Partitions
ON R.id = S.id
P1 P2
P2
P4
P3
Application
Server P3 P4
CMU 15-445/645 (Fall 2019)
11
T O D AY ' S A G E N D A
Execution Models
Query Planning
Distributed Join Algorithms
Cloud Systems
CMU 15-445/645 (Fall 2019)
12
PUSH VS. PULL
Approach #1: Push Query to Data
→ Send the query (or a portion of it) to the node that
contains the data.
→ Perform as much filtering and processing as possible
where data resides before transmitting over network.
Approach #2: Pull Data to Query
→ Bring the data to the node that is executing a query that
needs it for processing.
CMU 15-445/645 (Fall 2019)
13
P U S H Q U E R Y T O D ATA
SELECT * FROM R JOIN S Node
ON R.id = S.id P1→ID:1-100
R⨝S
IDs [101,200] Result: R ⨝ S
Application
Server Node
P2→ID:101-200
CMU 15-445/645 (Fall 2019)
14
P U L L D ATA T O Q U E R Y
P1→ID:1-100
SELECT * FROM R JOIN S Node
ON R.id = S.id Page ABC Storage
R⨝S
IDs [101,200] Page XYZ
Application
Server Node
P2→ID:101-200
CMU 15-445/645 (Fall 2019)
14
P U L L D ATA T O Q U E R Y
P1→ID:1-100
SELECT * FROM R JOIN S Node
ON R.id = S.id Page ABC Storage
R⨝S
IDs [101,200] Page XYZ
Application
Server Node
P2→ID:101-200
CMU 15-445/645 (Fall 2019)
14
P U L L D ATA T O Q U E R Y
P1→ID:1-100
SELECT * FROM R JOIN S Node
ON R.id = S.id Storage
R⨝S
IDs [101,200] Result: R ⨝ S
Application
Server Node
P2→ID:101-200
CMU 15-445/645 (Fall 2019)
15
O B S E R VAT I O N
The data that a node receives from remote sources
are cached in the buffer pool.
→ This allows the DBMS to support intermediate results
that are large than the amount of memory available.
→ Ephemeral pages are not persisted after a restart.
What happens to a long-running OLAP query if a
node crashes during execution?
CMU 15-445/645 (Fall 2019)
16
Q U E R Y FA U LT T O L E R A N C E
Most shared-nothing distributed OLAP DBMSs
are designed to assume that nodes do not fail
during query execution.
→ If one node fails during query execution, then the whole
query fails.
The DBMS could take a snapshot of the
intermediate results for a query during execution
to allow it to recover if nodes fail.
CMU 15-445/645 (Fall 2019)
17
Q U E R Y FA U LT T O L E R A N C E
SELECT * FROM R JOIN S Node
ON R.id = S.id Storage
R⨝S Result: R ⨝ S
Application
Server Node
CMU 15-445/645 (Fall 2019)
17
Q U E R Y FA U LT T O L E R A N C E
SELECT * FROM R JOIN S Node
ON R.id = S.id Storage
Result: R ⨝ S
Application
Server Node
CMU 15-445/645 (Fall 2019)
18
QUERY PL ANNING
All the optimizations that we talked about before
are still applicable in a distributed environment.
→ Predicate Pushdown
→ Early Projections
→ Optimal Join Orderings
Distributed query optimization is even harder
because it must consider the location of data in the
cluster and data movement costs.
CMU 15-445/645 (Fall 2019)
19
QUERY PL AN FRAGMENTS
Approach #1: Physical Operators
→ Generate a single query plan and then break it up into
partition-specific fragments.
→ Most systems implement this approach.
Approach #2: SQL
→ Rewrite original query into partition-specific queries.
→ Allows for local optimization at each node.
→ MemSQL is the only system that I know that does this.
CMU 15-445/645 (Fall 2019)
20
QUERY PL AN FRAGMENTS
SELECT * FROM R JOIN S
ON R.id = S.id
SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S
ON R.id = S.id ON R.id = S.id ON R.id = S.id
WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300
Id:1-100 Id:101-200 Id:201-300
CMU 15-445/645 (Fall 2019)
20
QUnion
U Ethe
R Youtput
P LofA N F R A G M E N T S
each join to produce
final result.
SELECT * FROM R JOIN S
ON R.id = S.id
SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S
ON R.id = S.id ON R.id = S.id ON R.id = S.id
WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300
Id:1-100 Id:101-200 Id:201-300
CMU 15-445/645 (Fall 2019)
21
O B S E R VAT I O N
The efficiency of a distributed join depends on the
target tables' partitioning schemes.
One approach is to put entire tables on a single
node and then perform the join.
→ You lose the parallelism of a distributed DBMS.
→ Costly data transfer over the network.
CMU 15-445/645 (Fall 2019)
22
DISTRIBUTED JOIN ALGORITHMS
To join tables R and S, the DBMS needs to get the
proper tuples on the same node.
Once there, it then executes the same join
algorithms that we discussed earlier in the
semester.
CMU 15-445/645 (Fall 2019)
23
SCENARIO #1
One table is replicated at every node. SELECT * FROM R JOIN S
Each node joins its local data and then ON R.id = S.id
sends their results to a coordinating
node.
P1:R⨝S P2:R⨝S
Id:1-100 R{Id} R{Id} Id:101-200
Replicated S S Replicated
CMU 15-445/645 (Fall 2019)
23
SCENARIO #1
One table is replicated at every node. SELECT * FROM R JOIN S
Each node joins its local data and then ON R.id = S.id
sends their results to a coordinating
node.
P1:R⨝S
R⨝S
P2:R⨝S
Id:1-100 R{Id} R{Id} Id:101-200
Replicated S S Replicated
CMU 15-445/645 (Fall 2019)
24
SCENARIO #2
Tables are partitioned on the join SELECT * FROM R JOIN S
attribute. Each node performs the join ON R.id = S.id
on local data and then sends to a node
for coalescing.
P1:R⨝S P2:R⨝S
Id:1-100 R{Id} R{Id} Id:101-200
Id:1-100 S{Id} S{Id} Id:101-200
CMU 15-445/645 (Fall 2019)
24
SCENARIO #2
Tables are partitioned on the join SELECT * FROM R JOIN S
attribute. Each node performs the join ON R.id = S.id
on local data and then sends to a node
for coalescing.
P1:R⨝S
R⨝S
P2:R⨝S
Id:1-100 R{Id} R{Id} Id:101-200
Id:1-100 S{Id} S{Id} Id:101-200
CMU 15-445/645 (Fall 2019)
25
SCENARIO #3
Both tables are partitioned on SELECT * FROM R JOIN S
different keys. If one of the tables is ON R.id = S.id
small, then the DBMS broadcasts
that table to all nodes.
Id:1-100 R{Id} R{Id} Id:101-200
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
25
SCENARIO #3
Both tables are partitioned on SELECT * FROM R JOIN S
different keys. If one of the tables is ON R.id = S.id
small, then the DBMS broadcasts
that table to all nodes.
S
Id:1-100 R{Id} R{Id} Id:101-200
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
25
SCENARIO #3
Both tables are partitioned on SELECT * FROM R JOIN S
different keys. If one of the tables is ON R.id = S.id
small, then the DBMS broadcasts
that table to all nodes.
S S
Id:1-100 R{Id} R{Id} Id:101-200
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
25
SCENARIO #3
Both tables are partitioned on SELECT * FROM R JOIN S
different keys. If one of the tables is ON R.id = S.id
small, then the DBMS broadcasts
that table to all nodes.
P1:R⨝S P2:R⨝S
S S
Id:1-100 R{Id} R{Id} Id:101-200
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
25
SCENARIO #3
Both tables are partitioned on SELECT * FROM R JOIN S
different keys. If one of the tables is ON R.id = S.id
small, then the DBMS broadcasts
that table to all nodes.
P1:R⨝S
R⨝S
P2:R⨝S
S S
Id:1-100 R{Id} R{Id} Id:101-200
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
26
SCENARIO #4
Both tables are not partitioned on the SELECT * FROM R JOIN S
join key. The DBMS copies the tables ON R.id = S.id
by reshuffling them across nodes.
Name:A-M R{Name} R{Name} Name:N-Z
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
26
SCENARIO #4
Both tables are not partitioned on the SELECT * FROM R JOIN S
join key. The DBMS copies the tables ON R.id = S.id
by reshuffling them across nodes.
R{Id} Id:101-200
Name:A-M R{Name} R{Name} Name:N-Z
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
26
SCENARIO #4
Both tables are not partitioned on the SELECT * FROM R JOIN S
join key. The DBMS copies the tables ON R.id = S.id
by reshuffling them across nodes.
Id:1-100 R{Id} R{Id} Id:101-200
Name:A-M R{Name} R{Name} Name:N-Z
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
26
SCENARIO #4
Both tables are not partitioned on the SELECT * FROM R JOIN S
join key. The DBMS copies the tables ON R.id = S.id
by reshuffling them across nodes.
Id:1-100 R{Id} R{Id} Id:101-200
S{Id} Id:101-200
Name:A-M R{Name} R{Name} Name:N-Z
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
26
SCENARIO #4
Both tables are not partitioned on the SELECT * FROM R JOIN S
join key. The DBMS copies the tables ON R.id = S.id
by reshuffling them across nodes.
Id:1-100 R{Id} R{Id} Id:101-200
Id:1-100 S{Id} S{Id} Id:101-200
Name:A-M R{Name} R{Name} Name:N-Z
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
26
SCENARIO #4
Both tables are not partitioned on the SELECT * FROM R JOIN S
join key. The DBMS copies the tables ON R.id = S.id
by reshuffling them across nodes.
P1:R⨝S P2:R⨝S
Id:1-100 R{Id} R{Id} Id:101-200
Id:1-100 S{Id} S{Id} Id:101-200
Name:A-M R{Name} R{Name} Name:N-Z
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
26
SCENARIO #4
Both tables are not partitioned on the SELECT * FROM R JOIN S
join key. The DBMS copies the tables ON R.id = S.id
by reshuffling them across nodes.
P1:R⨝S
R⨝S
P2:R⨝S
Id:1-100 R{Id} R{Id} Id:101-200
Id:1-100 S{Id} S{Id} Id:101-200
Name:A-M R{Name} R{Name} Name:N-Z
Val:1-50 S{Val} S{Val} Val:51-100
CMU 15-445/645 (Fall 2019)
27
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent
during joins. R S
→ This is like a projection pushdown. S
Some DBMSs support SEMI JOIN
SQL syntax. Otherwise you fake it
with EXISTS.
CMU 15-445/645 (Fall 2019)
27
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent
during joins. R S
→ This is like a projection pushdown. R
Some DBMSs support SEMI JOIN
SQL syntax. Otherwise you fake it
with EXISTS.
CMU 15-445/645 (Fall 2019)
27
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent
during joins. R S
→ This is like a projection pushdown. R.id
R.id
Some DBMSs support SEMI JOIN SELECT R.id FROM R
SQL syntax. Otherwise you fake it WHERE EXISTS (
SELECT 1 FROM S
with EXISTS. WHERE R.id = S.id)
CMU 15-445/645 (Fall 2019)
27
SEMI-JOIN
SELECT R.id FROM R
Join operator where the result only LEFT OUTER JOIN S
contains columns from the left table. ON R.id = S.id
WHERE R.id IS NOT NULL
Distributed DBMSs use semi-join to
minimize the amount of data sent R.id
during joins. R S
→ This is like a projection pushdown. R.id
Some DBMSs support SEMI JOIN SELECT R.id FROM R
SQL syntax. Otherwise you fake it WHERE EXISTS (
SELECT 1 FROM S
with EXISTS. WHERE R.id = S.id)
CMU 15-445/645 (Fall 2019)
28
R E L AT I O N A L A L G E B R A : S E M I - J O I N
R(a_id,b_id,xxx) S(a_id,b_id,yyy)
Like a natural join except that the a_id b_id xxx a_id b_id yyy
attributes on the right table that are a1 101 X1 a3 103 Y1
not used to compute the join are a2
a3
102
103
X2
X3
a4
a5
104
105
Y2
Y3
restricted.
(R ⋉ S)
Syntax: (R⋉ S) a_id b_id xxx
a3 103 X3
CMU 15-445/645 (Fall 2019)
29
CLOUD SYSTEMS
Vendors provide database-as-a-service (DBaaS)
offerings that are managed DBMS environments.
Newer systems are starting to blur the lines
between shared-nothing and shared-disk.
→ Example: You can do simple filtering on Amazon S3
before copying data to compute nodes.
CMU 15-445/645 (Fall 2019)
30
CLOUD SYSTEMS
Approach #1: Managed DBMSs
→ No significant modification to the DBMS to be "aware"
that it is running in a cloud environment.
→ Examples: Most vendors
Approach #2: Cloud-Native DBMS
→ The system is designed explicitly to run in a cloud
environment.
→ Usually based on a shared-disk architecture.
→ Examples: Snowflake, Google BigQuery, Amazon
Redshift, Microsoft SQL Azure
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Rather than always maintaining compute
resources for each customer, a "serverless" DBMS
evicts tenants when they become idle.
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Rather than always maintaining compute
resources for each customer, a "serverless" DBMS
evicts tenants when they become idle.
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Rather than always maintaining compute
resources for each customer, a "serverless" DBMS
evicts tenants when they become idle.
Storage
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Rather than always maintaining compute
resources for each customer, a "serverless" DBMS
evicts tenants when they become idle.
Storage
Buffer Pool
Page Table
Node
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Rather than always maintaining compute
resources for each customer, a "serverless" DBMS
evicts tenants when they become idle.
Storage
Application
Server
CMU 15-445/645 (Fall 2019)
31
S E R V E R L E S S D ATA B A S E S
Rather than always maintaining compute
resources for each customer, a "serverless" DBMS
evicts tenants when they become idle.
Storage
Node
Application Buffer Pool
Page Table
Server
CMU 15-445/645 (Fall 2019)
32
D I S A G G R E G AT E D C O M P O N E N T S
System Catalogs
→ HCatalog, Google Data Catalog, Amazon Glue Data
Catalog
Node Management
→ Kubernetes, Apache YARN, Cloud Vendor Tools
Query Optimizers
→ Greenplum Orca, Apache Calcite
CMU 15-445/645 (Fall 2019)
33
U N I V E R S A L F O R M AT S
Most DBMSs use a proprietary on-disk binary file
format for their databases.
→ Think of the BusTub page types…
The only way to share data between systems is to
convert data into a common text-based format
→ Examples: CSV, JSON, XML
There are new open-source binary file formats
that make it easier to access data across systems.
CMU 15-445/645 (Fall 2019)
34
U N I V E R S A L F O R M AT S
Apache Parquet Apache Iceberg
→ Compressed columnar storage from → Flexible data format that supports
Cloudera/Twitter schema evolution from Netflix.
Apache ORC HDF5
→ Compressed columnar storage from → Multi-dimensional arrays for
Apache Hive. scientific workloads.
Apache CarbonData Apache Arrow
→ Compressed columnar storage with → In-memory compressed columnar
indexes from Huawei. storage from Pandas/Dremio.
CMU 15-445/645 (Fall 2019)
35
CONCLUSION
More money, more data, more problems…
Cloud OLAP Vendors:
On-Premise OLAP Systems:
CMU 15-445/645 (Fall 2019)
36
NEXT CLASS
Oracle Guest Speaker
CMU 15-445/645 (Fall 2019)