Centralized versus distributed DBMS
Centralized
Processor
Distributed Databases
Memory Disk … Disk
CPS 216 Distributed
Processor Processor
Advanced Database Systems
…
Memory Disk Memory Disk
Disk
Disk Disk
Disk
2
Parallel versus distributed DBMS Distributed DBMS issues
• Parallel DBMS • Database management with multiple sites that are
– Fast interconnect possibly autonomous and heterogeneous
– Homogeneous hardware/software – Data organization
– Total control over components – Query processing and optimization
• Distributed DBMS – Concurrency control and recovery
– Geographically distributed
• Disconnected operations possible
– Heterogeneous hardware/software
• Performance, data formats, data processing capabilities
– Autonomy of individual sites 3 4
Data organization Partitioning schemes
A1 A2 A3 A4
• Horizontal t1 …
• Top-down approach t2 … Site 1
t3 … Site 2
– Have a database t4 …
………… … … …
– How to partition and/or replicate it across sites … Site k
…
• Bottom-up approach A1 A2 A3 A4
– Have existing databases at different sites • Vertical t1 …
…
t2
– How to integrate them together and deal with t3 …
t4 …
heterogeneity and autonomy ………… … … …
…
…
• Focus for today
– Data partitioning using a top-down approach • Or hybrid Site 1 Site 2 Site k
5 6
1
Horizontal partitioning schemes Properties of a correct partitioning
• Round-robin partitioning R → { R1, R2, …, Rk }
• Hash partitioning
• Range partitioning • Completeness and reconstructability
• Predicate-based partitioning R = R1 ∪ R2 ∪ … ∪ Rk
• Derived horizontal partitioning
• Disjointness
Ri ∩ Rj = Ø for any i ≠ j
7 8
Round-robin partitioning Hash partitioning
R R0 R1 R2 R R0 R1 R2
t1 t1 t1 hash(k1) = 2 t1
t2 t2 t2 hash(k2) = 0 t2
t3 t3 t3 hash(k3) = 0 t3
t4 t4 t4 hash(k4) = 1 t4
… …
• Evenly distributes data • Evenly distributes data (assuming a good hash function)
• Good for full relation scans • Good for point queries and equijoins on the partitioning
• Not good for range queries attribute
9 • Not good for range queries 10
Range partitioning Predicate-based partitioning
R partitioning vector: <4, 7> R0 R1 R2
t1 k1 = 5 t1
• Fragmentation
t2 k2 = 8 t2
– Decide how to divide a relation horizontally into
t3 k3 = 2 t3
fragments using a set of predicates
t4 k4 = 3 t4
…
• Allocation
• Good for range queries on the partitioning attribute
– Decide which fragments go to which site
• The choice of partitioning vector is important
– Bad vector may result in both data skew and execution skew
11 12
2
Predicate-based fragmentation Example
• Given a relation R and a set of simple predicates • Say queries use simple predicates:
P = { p1, p2, …, pn } A < 10, A > 5, D = ’CS’, D = ’EE’
• Generate minterm predicates • Generate, simplify, and eliminate minterms
A < 10 ∧ A > 5 ∧ D = ’CS’ ∧ D = ’EE’ eliminated
– M = { m | m = ∧ (1 ≤ k ≤ n) pk* }, where pk* is either pk
or ¬pk A < 10 ∧ A ≤ 5 ∧ D = ’CS’ ∧ D ≠ ’EE’ A ≤ 5 ∧ D = ’CS’
…
– Simplify minterms in M and eliminate useless ones
• Final set of fragments
• For each m in M, generate a fragment σm R
σ5 < A < 10 ∧ D = ’CS’ R σ5 < A < 10 ∧ D = ’EE’ R
σA ≤ 5 ∧ D = ’CS’ R σA ≤ 5 ∧ D = ’EE’ R
σA ≥ 10 ∧ D = ’CS’ R σA ≥ 10 ∧ D = ’EE’ R
13 14
Choice of simple predicates Allocation of fragments
• Completeness • Tough optimization problem
– There is an equal probability of access by every – Do we replicate fragments?
application to any two tuples in the same minterm – Where we place each copy of each fragment?
fragment • Metrics: minimize query response time; maximize
• If p is used in fragmentation, then σpR either accesses all throughput; minimize network traffic; …
tuples in a fragment or none in a fragment
• Constraints: available storage, bandwidth, processing
• Minimality
power; response time requirement; …
– If a predicate causes a fragment f to be further
fragmented into fi and fj, there should at least one • Issues: origin of queries; selectivity of fragments; query
application that accesses fi and fj differently processing strategies; consistency enforcement; …
» Use all relevant predicates in frequent queries!
15 16