Distributed Databases
Content
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
Distributed Query Processing
2 Advanced Database system principle
Distributed Database System
A distributed database system:
loosely coupled sites
share no physical component
Database systems that run on each site:
independent of each other
Transactions may access data
over one or several sites
Can be Homogeneous or Heterogeneous
3 Advanced Database system principle
Homogeneous Distributed Databases
In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing
user requests.
Each site surrenders its right to change schemas or software
Appears to user as a single system
4 Advanced Database system principle
Heterogeneous Distributed Databases
In a heterogeneous distributed database
Different sites may use different schemas and software
Different schema is a major problem for query processing
Different software is a major problem for transaction processing
Sites may not be aware of each other
5 Advanced Database system principle
Content
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
Distributed Query Processing
6 Advanced Database system principle
Data Fragmentation
Divide relation r into fragments
r1, r2, …, rn
Fragment contains sufficient information
Horizontal fragmentation: (data sharding)
tuples are assigned to one or more fragments
Vertical fragmentation:
schema of r is split into several smaller schemas
7 Advanced Database system principle
Data Fragmentation
Vertical fragmentation:
All schemas must contain a common candidate key (or superkey)
to ensure lossless join property.
The tuple-id attribute may be added to each schema to serve as a
candidate key.
8 Advanced Database system principle
Horizontal Fragmentation of account Relation
account_number branch_name balance
A-305 Hillside 500
A-226 Hillside 336
A-155 Hillside 62
account1 = branch_name=“Hillside” (account )
account_number branch_name balance
A-177 Valleyview 205
A-402 Valleyview 10000
A-408 Valleyview 1123
A-639 Valleyview 750
account2 = branch_name=“Valleyview” (account )
9 Advanced Database system principle
Vertical Fragmentation of employee_info Relation
branch_name customer_name tuple_id
Hillside Lowman 1
Hillside Camp 2
Valleyview Camp 3
Valleyview Kahn 4
Hillside Kahn 5
Valleyview Kahn 6
Valleyview Green 7
deposit1 = branch_name, customer_name, tuple_id (employee_info )
account_number balance tuple_id
A-305 500 1
A-226 336 2
A-177 205 3
A-402 10000 4
A-155 62 5
A-408 1123 6
A-639 750 7
10 Advanced Database system principle
deposit2 = account_number, balance, tuple_id (employee_info )
Advantages of Fragmentation
Horizontal:
allows parallel processing on fragments of a relation
scalability
Vertical:
allows vertically split a relation, assign attributes to the sites
where they are most frequently accessed
tuple-id attribute allows efficient joining of vertical fragments
11 Advanced Database system principle
Data Replication
Data Replication:
a relation, or fragment of a relation
stored redundantly
in two or more sites.
Full replication of a relation:
the relation is stored at all sites.
Fully redundant databases:
every site contains a copy of the entire database.
12 Advanced Database system principle
Data Replication (Cont.)
Advantages of Replication
Availability:
One site failure, but r is available at other sites
Parallelism: queries on r may be processed by several nodes in
parallel
Reduced data transfer: relation r is available locally at each site
containing a replica of r.
13 Advanced Database system principle
Data Replication (Cont.)
Disadvantages of Replication:
Mainly data consistency:
Increased cost of updates: each replica of relation r must be
updated.
Increased complexity of concurrency control: concurrent updates
may lead to inconsistent data
14 Advanced Database system principle
Data Transparency
Data transparency:
the degree, user may not know the details:
how and where the data items are stored
Several kinds of transparency:
Fragmentation transparency
Replication transparency
Location transparency
15 Advanced Database system principle
Naming Data
locating entity, simplify management, across different sites.
Solution: Naming management, etc.
Naming management:
1. Give a system-wide unique name;
2. It should be possible to find the location of data items
efficiently;
3. It should be possible to change the location of data items
transparently.
16 Advanced Database system principle
Centralized Scheme - Name Server
Structure:
name server assigns all names
each site maintains a record of local data items
sites ask name server to locate non-local data items
Disadvantages:
name server is a potential performance bottleneck
name server is a single point of failure
17 Advanced Database system principle
Use of Aliases
Alternative to centralized scheme:
Add site id: i.e., [Link].
fails to achieve network transparency.
Therefore:
Create a set of aliases for data items;
store the mapping of aliases at each site.
User unaware of the physical location of data
unaffected if the data item is moved to other sites
18 Advanced Database system principle
Content
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
Distributed Query Processing
19 Advanced Database system principle
Distributed Transactions
Transaction may access data at several sites.
Each site has a local transaction manager, for:
Maintaining a log for recovery purposes
Participating in coordinating the concurrent execution of the
transactions.
20 Advanced Database system principle
Distributed Transactions
Each site has a transaction coordinator, for:
Starting the execution of transactions
Distributing subtransactions at appropriate sites.
Coordinating the termination of each transaction
21 Advanced Database system principle
Content
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
Distributed Query Processing
22 Advanced Database system principle
System Failure Modes
Unique failures of distributed systems:
Failure of a site.
Loss of messages
Handled by TCP-IP protocols etc.
Failure of a communication link
Handled by routing messages via alternative links
Network partition
if it is split into two or more subsystems, without any connection
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally indistinguishable.
23 Advanced Database system principle
Commit Protocols
Used to ensure atomicity across sites
a transaction executed at multiple sites
must be committed, or aborted at all the sites
The two-phase commit (2PC) protocol is widely used
The three-phase commit (3PC) protocol is more complicated ,
expensive, but avoids some drawbacks of two-phase commit protocol.
This protocol is seldom used in practice.
24 Advanced Database system principle
Two Phase Commit Protocol (2PC)
Assumes fail-stop model
failed sites simply stop working
Executed by the coordinator, after the transaction is executed
The protocol involves all the local sites at which the transaction
executed
Let T be a transaction initiated at site Si, and let the transaction
coordinator at Si be Ci
25 Advanced Database system principle
Phase 1: Obtaining a Decision
Coordinator asks all participants to prepare to commit transaction Ti.
Ci adds the records <prepare T> to the log and forces log to
stable storage
sends prepare T messages to all sites (T executed)
26 Advanced Database system principle
Phase 1: Obtaining a Decision
Upon receiving message, transaction manager at site determines if it
can commit the transaction
if not, add a record <no T> to the log and send abort T message
to Ci
if the transaction can be committed, then:
add the record <ready T> to the log
force all records for T to stable storage
send ready T message to Ci
27 Advanced Database system principle
Phase 2: Recording the Decision
If received a ready T from all sites, then
Ci can commit T
else
T must be aborted.
Coordinator:
adds a record to the log: <commit T> or <abort T>
forces record onto stable storage, once on stable storage, it is
irrevocable
sends a message to each participant, informing commit or abort
Participants take appropriate action locally.
28 Advanced Database system principle
Handling of Failures - Site Failure
To recover a site Si ,first check its log,for:
Contains <commit T>,finished,perform redo (T)
Contains <abort T>: not finished,site executes undo (T)
Contains <ready T>: have submitted,requires Ci to decide operations
of transaction
If T committed, redo (T) (if all committed)
If T aborted, then undo (T),(aborted by a site)
If no message about T,then Sk fails at the preparing stage,then Sk must
execute undo (T)
29 Advanced Database system principle
Handling of Failures- Coordinator Failure
C fails, sites decide T’s fate:
If one active site contains <commit T> ,then T must be
committed.
If one active site contains <abort T> ,then T must be aborted.
If a site has no <ready T> ,then must be aborted by that site, then
abort T.
If all contain <ready T>,and no other message like (such as
<abort T> of <commit T>). All have to wait for C to decide.
Blocking problem : active sites may have to wait for failed coordinator
to recover.
30 Advanced Database system principle
Handling of Failures - Network Partition
If the coordinator and all its participants remain in one partition,
the failure has no effect on the commit protocol
If they belong to several partitions:
Participants think that Coordinator fails, then execute recovery
protocol, wait until Coordinator is up;
Coordinator think that participants are down, then execute
corresponding protocols
31 Advanced Database system principle
Recovery and Concurrency Control
In-doubt transactions:
have a <ready T>, not ( <commit T>, or <abort T>)
The recovering site determines transactions by contacting
other sites;
this can slow down and potentially block recovery
Alternatively, recovery algorithms rewrite lock information like:
Instead of <ready T>, write out <ready T, L>, L = list of
locks held by T.
Need acquires grant lock for all locks
After acquiring locks, transactions can be resumed
Execute commit or abort
Fast!
32 Advanced Database system principle
Three Phase Commit (3PC)
Assumptions:
No network partitioning
At any point, at least one site must be up.
At most K sites (participants as well as coordinator) can fail
Phase 1: identical to 2PC Phase 1
ready to commit
In phase 2 coordinator makes a decision (called the pre-commit
decision) and records it in multiple (at least K) sites
In phase 3, coordinator sends commit/abort message to all
participating sites
33 Advanced Database system principle
Three Phase Commit (3PC)
Advantage:
under 3PC, pre-commit decision can be used to commit (despite
failure of some coordinators)
Avoids blocking problem as long as < K sites fail
Disadvantage:
higher overheads
assumptions may not be satisfied in practice
34 Advanced Database system principle
Content
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
Distributed Query Processing
35 Advanced Database system principle
Concurrency Control
Modify concurrency control schemes for distributed environment
We assume that each site participates in the same commit protocol
to ensure global transaction atomicity.
We assume all replicas of any item are updated
Will see how to relax this in case of site failures later
36 Advanced Database system principle
Single-Lock-Manager Approach
System maintains a single lock manager Si
When a transaction needs to lock a data item:
it sends a lock request to Si
lock manager determines grant immediately or not
If yes, lock manager sends back a message
If no, request is delayed until it can be granted
37 Advanced Database system principle
Single-Lock-Manager Approach (Cont.)
The transaction read the data item from any site storing the replica
Writes must be performed on all replicas of a data item
Advantages of scheme:
Simple implementation
Simple deadlock handling
Disadvantages of scheme are:
Bottleneck: lock manager site
Vulnerable
38 Advanced Database system principle
Distributed Lock Manager
Locking is distributed and implemented at each local site
Lock managers control access to local data items
Advantage:
distributed, robust
Disadvantage:
deadlock detection
Several variants of this approach
Primary copy
Majority protocol
Biased protocol
Quorum consensus 法定人数通过 协议
39 Advanced Database system principle
Primary Copy
Choose one replica as primary copy of a data item
primary site: containing that replica
Different data items --- different primary sites
To lock on a data item Q, requests a lock at the primary site of Q
Implies getting lock on all replicas of Q
Benefit
Concurrency control is simple
Drawback
If the primary site fails, Q is inaccessible even though other sites is
accessible.
40 Advanced Database system principle
Majority Protocol
Local lock manager manages lock and unlock requests for data
items stored at that site
To lock an unreplicated data item Q at site Si, send message to the
lock managers of Si.
41 Advanced Database system principle
Majority Protocol (Cont.)
To lock replicated data
If Q is replicated at n sites, then lock request message must be
sent to at least n/2 sites
Do nothing on Q, until obtain a lock on a majority of the
replicas of Q
When writing, need to write on all replicas.
42 Advanced Database system principle
Majority Protocol (Cont.)
Benefit
Can be used even when some sites are unavailable
Drawback
Requires 2(n/2 + 1) messages for lock requests, (n/2 + 1)
messages for unlock requests.
Might cause deadlock even with single item
e.g., each of 3 transactions has locks on 1/3 of the replicas of a
data.
43 Advanced Database system principle
Biased Protocol
In biased protocol:priority order
shared locks > exclusive locks
Shared locks. to lock data item Q, request a lock on a replica of
Q.
Exclusive locks. to lock Q, request a lock on all replicas of Q.
Advantage - imposes less overhead on read operations.
Disadvantage - additional overhead on writes
44 Advanced Database system principle
Quorum Consensus Protocol
A generalization of majority and biased protocols
Site: Site 1, Site 2, …, Site n
Weight: S1, S2, …, Sn
S = sum (S1,…,Sn)
Choose two values : read quorum Qr , write quorum Qw
Requires: Qr + Qw > S
Qw > S / 2
Quorums is chose separately for each item
To read, must lock replicas:
Slock >= Qr
To write, must lock replicas:
Slock >= Qw
Advantage: control weight & quorum consensus, reduce
read/write cost
Advanced Database system principle
45
Timestamping
A timestamp based concurrency-control protocol:
Each transaction -> a unique timestamp
to determine its serialization order.
Main task: generate timestamp in a distributed way
Locally: generates a unique timestamp
using logical counter or clock
Globally: concatenating
local timestamp with unique identifier
46 Advanced Database system principle
Timestamping (Cont.)
A site with a slow (fast) clock,
will assign larger (smaller) timestamps
To fix this problem
In each site Si , define a logical clock (LCi), generating unique
local timestamp
If a transaction Ti with timestamp < x,y> visit Si, and x > Lci,
then LCi + 1
47 Advanced Database system principle
Deadlock Handling
Consider the following two transactions list: site 1 has item X and
transaction T1 ; site 2 has item Y and transaction T2 :
T1 : write (X) T2 : write (Y)
write (Y) write (X)
X-lock on X
write (X) X-lock on Y
write (Y)
wait for X-lock on X
Wait for X-lock on Y
Result: deadlock which cannot be detected locally at either site
48 Advanced Database system principle
Centralized Approach
A global wait-for graph
constructed and maintained in a single site
called deadlock-detection coordinator
two approaches:
Real graph: real, unknown
Constructed graph: approximation generated
49 Advanced Database system principle
Centralized Approach
To construct:
Add / delete an edge
if inserted/removed in local wait-for graph
Then, coordinator perform cycle-detection
If the coordinator finds a cycle, it selects a victim and notifies all
sites. All sites roll back the victim transaction
50 Advanced Database system principle
Local and Global Wait-For Graphs
Local
Global
51 Advanced Database system principle
Wait-For Graph for False Cycles
Given the right figure,
1. T2 releases resources at S1
message: remove edge T1 T2
2. And then T2 requests a resource from T3
message: insert edge T2 T3
Suppose insert message is before delete message
due to network delays
The coordinator would then find a false cycle
Initial state:
T1 T2 T3 T1
False cycles cannot occur in two-phase protocol
Generally cause less serious problem
Still need to select the victim
52 Advanced Database system principle
Content
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
Distributed Query Processing
53 Advanced Database system principle
Availability
High availability:
distributed db is available at all the time (99.99%)
Robustness:
ability to function despite system failures
To be robust, a distributed system must
Detect failures
Reconfigure the system to continue computation
Recovery/reintegration after a site/link is repaired
54 Advanced Database system principle
Reconfiguration
Reconfiguration:
Abort all active transactions, because
they may hold locks on other sites, other transactions might not
work
only some replicas failed, but transactions continue to visit data in
failed site
55 Advanced Database system principle
Reconfiguration (Cont)
Reconfiguration:
If replica were at failed site, update system catalog to remove
them from the list of replicas.
Reversed this when failed site recovers
If a failed site was a central server, use election to determine the
new server
E.g. name server, concurrency coordinator, global deadlock
detector
56 Advanced Database system principle
Reconfiguration (Cont.)
Hard to distinguish:
network partition & site failure
After reconfiguration for network partition, avoid:
Two/more central servers elected from different partitions
More than one partition update a replicated data
Updates must continue even if some sites are down
Solution: majority based approach
Alternative “read one write all available”
57 Advanced Database system principle
Majority-Based Approach
Modify existing majority protocol for distributed concurrency
control
Each replica has a version number,
Ver # is updated when the replica is updated
Send lock request to at least ½ the sites
Continue operation if a lock is obtained on a majority of the sites
Read operations: read the value from the replica with largest version
number – the latest
write back to replicas with lower version numbers
58 Advanced Database system principle
Majority-Based Approach (Cont.)
Write operations
find highest version number, set
new version number = old highest version + 1
Write to all locked replicas, update version number
Note: reads are guaranteed to know the latest version number
59 Advanced Database system principle
Read One Write All
Read one write all: allows to read any one replica, but updates
require all replicas are available at commit time
Incorrect for link failure & network partitioning:
If failed link is up now, a disconnected site do not know it was
disconnected
The site has old values, read from that site, incorrect
With network partitioning, sites in each partition may update same
item concurrently
As they believe sites in other partitions have failed
60 Advanced Database system principle
Site Reintegration
After recovery, need to catch all updates missed
Problem: updates may be happened to items whose replica (the
site) is recovering
Solution 1: halt all updates, while reintegrating a site
Cause unacceptable disruption
Solution 2: lock all replicas at the site, update to latest version,
then release locks
61 Advanced Database system principle
Coordinator Selection
Backup coordinators
if coordinator fails, site with enough information, as the new
coordinator,
executes same algorithms as the previous coor.
allows fast recovery from coordinator failure, but generates
overhead in normal situation
Election algorithms
used to elect a new coordinator if failures
Example: Bully Algorithm威逼算法
62 Advanced Database system principle
Bully Algorithm
If site Si sends a request, no response from the coordinator, within T,
then assume it failed
Si elects itself as the new coordinator
Si :
sends an election message,
to every site,
with a higher identification number,
then waits for any process to answer within T.
63 Advanced Database system principle
Bully Algorithm
If no response within T,
assume that all sites (with greater id) failed, Si elects itself the
new coordinator.
If answer is received,
in a new T’,
Si wait for the message: a site with a higher id is elected,
if fail again, restart the algorithm.
64 Advanced Database system principle
Bully Algorithm (Cont.)
After a failed site recovers, it immediately starts this algorithm
If no active sites with higher id, the recovered site forces all
processes to let itself become the coordinator, even if there exists
an active coordinator
Therefore, is called Bully algorithm.
65 Advanced Database system principle
Content
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
Distributed Query Processing
66 Advanced Database system principle
Distributed Query Processing
For centralized systems, mainly concern the number of disk
accesses as the cost.
In a distributed system, need consider:
data transmission over the network.
Might speedup query processing (distributed)
67 Advanced Database system principle
Query Transformation
Relations might be fragmented, needs to reconstruct.
Consider the horizontal fragmentation of account
account1 = branch_name = “Hillside” (account )
account2 = branch_name = “Valleyview” (account )
The query branch_name = “Hillside” (account ) becomes
branch_name = “Hillside” (account1 account2)
which is optimized into
branch_name = “Hillside” (account1) branch_name = “Hillside” (account2)
Final strategy is to return account1 as the result of the query.
68 Advanced Database system principle
Simple Join Processing
Consider three relations, neither replicated nor fragmented
account depositor branch
account is stored at site S1
depositor at S2
branch at S3
For a query issued at site S1, the system needs to produce the
result at site S1
69 Advanced Database system principle
Possible Query Processing Strategies
S1: Ship copies of all three relations to site S1,
then process at S1.
S2: Ship a copy of account to site S2,
compute temp1 = account depositor at S2
ship temp1 to S3,
compute temp2 = temp1 branch at S3
ship the result temp2 to S1
Must consider following factors:
amount of data being shipped
cost of transmitting data blocks between sites
relative processing speed at each site
70 Advanced Database system principle
Semijoin Strategy
r1 at S1, r2 at S2
Before shipping r2, remove those in r2 do not participate the join
operation. Steps:
1. Compute temp1 R1 R2 (r1) at S1.
2. Ship temp1 from S1 to S2.
3. Compute temp2 r2 temp1 at S2
4. Ship temp2 from S2 to S1.
5. Compute r1 temp2 at S1. This is the same as r1 r2.
71 Advanced Database system principle
Formal Definition
The semijoin of r1 with r2, is denoted by:
r1 r2
it is defined by:
R1 (r1 r2)
Thus, r1 r2 selects those tuples of r1 that contributed to r1 r2.
In step 3, temp2=r2 r1.
For joins of several relations, can extend to a series of semi-join
steps.
72 Advanced Database system principle
End of Chapter