Motivation
• Lots of data records, sharded on multiple servers, lots of clients
• Client application actions often involve multiple reads and writes
o Bank transfer: debit and credit
• We'd like to hide interleaving and failure from application writers
• This is a traditional database concern
o But the ideas are used in many distributed systems
Example
• x and y are bank balances
o Records in database tables
• x and y are on different servers (maybe at different banks)
• x and y start out as $10
• T1 and T2 are transactions
T1: transfer $1 from y to x
T2: audit, to check that no money is lost
T1: T2:
begin_transaction begin_transaction
add(x, 1) tmp1 = get(x)
add(y, -1) tmp2 = get(y)
end_transaction print tmp1, tmp2
end_transaction
Correct behavior for a transaction
• Usually called "ACID"
o Atomic
§ All writes or none, despite failures (abortability)
o Consistent
§ Obeys application-specific invariants
o Isolated
§ No interference between transactions (serializable)
o Durable
§ Committed writes are permanent
• We're interested in ACID for distributed transactions
o With data sharded over multiple servers
Serializability
• Execution of some concurrent transactions yield results
o "results" means both changes (T1) and output (T2) in the DB
• The results are serializable if:
o There exists a serial execution order of the transactions
o Yields the same results as the actual execution
§ Serial means one at a time
• No parallel execution
§ This definition should remind you of linearizability
• You can test whether an execution's result is serializable by looking
for an order that yields the same results
o For our example, the possible serial orders are
T1; T2 or T2; T1
• So, the correct (serializable) results are:
T1; T2: x=11 y=9 "11,9"
T2; T1: x=11 y=9 "10,10"
• The results for the two differ; either is okay
o No other result is okay
• The implementation might have executed T1 and T2 in parallel
o But it must still yield results as if in a serial order
What if T1's operations run entirely between T2's two get()s? would the
result be serializable?
• T2 would print 10,9
• But 10,9 is not one of the two serializable results!
What if T2 runs entirely between T1's two adds()s?
• T2 would print 11,10
• But 11,10 is not one of the two serializable results!
Why serializability is popular?
• An easy model for programmers
o They can write complex transactions while ignoring concurrency
• It allows parallel execution of transactions on different records
A transaction can "abort" if something goes wrong
• An abort un-does any record modifications
1. The transaction might voluntarily abort,
§ E.g., if the account doesn't exist, or y's balance is <= 0
2. The system may force an abort, e.g., to break a deadlock
3. Some server’s failures result in abort
• The application might (or might not) try the transaction again
Components of distributed transactions
1. Concurrency control
o To provide isolation/serializability
2. Atomic commit
o To provide atomicity despite failure
Concurrency control
• Correct execution of concurrent transactions
Classes of concurrency control
1. Pessimistic concurrency control
o Lock records before use
o Conflicts cause delays (waiting for locks)
o Faster if conflicts are frequent
§ E.g., 2PL
2. Optimistic concurrency control
o Use records without locking
o Check before commit if reads/writes were serializable
o Conflict causes abort+retry
o Faster if conflicts are rare
§ E.g., FaRM
Pessimistic Concurrency Control
Two-phase locking (2PL)
• Used to implement serializability
Definition:
• A transaction must acquire a record's lock before using it
• A transaction must hold its locks until *after* commit or abort
2PL for our example
• Suppose T1 and T2 start at the same time
• The transaction system automatically acquires locks as needed
• So first of T1/T2 to use x will get the lock
o Get another lock for y as well
• The other waits until the first completely finishes
• This prohibits the non-serializable interleaving
Why hold locks until after commit/abort? why not release as soon as done with
the record?
• Example 1 of a resulting problem:
o Suppose T2 releases x's lock after get(x)
o T1 could then execute between T2's get()s
o T2 would print 10,9
o Oops: that is not a serializable execution: neither T1;T2 nor
T2;T1
• Example 2 of a resulting problem:
o Suppose T1 writes x, then releases x's lock
o T2 reads x and prints
o T1 then aborts
§ Maybe because the value of y is already 0
o Oops: T2 used a value that never really existed
o We should have aborted T2, which would be a "cascading abort"
• Deadlock!
T1 T2
get(x) get(y)
get(y) get(x)
• The system must detect cycles and abort a transaction (e.g., by using
lock timeout)
Optimistic Concurrency Control
• Increases concurrency more than pessimistic concurrency control
o i.e., Increases transactions per second by lowering latency
• Used in Dropbox, Wikipedia, key-value stores like Cassandra, and
Amazon’s Dynamo
• Preferable than pessimistic when conflicts are expected to be rare
o But still need to ensure conflicts are caught
First-cut approach:
• Write and read objects at will
• Check for serial equivalence at commit time
• If abort, roll back updates made
• An abort may result in other transactions that read dirty data, also
being aborted
• Any transactions that read from those transactions also now need to be
aborted
o Cascading aborts
Atomic Commit
How can distributed transactions cope with failures?
• Suppose for our example, x and y are on different "worker" servers
• Suppose x's server adds 1, but y's crashes before subtracting
o Or x's server adds 1, but y realizes the account doesn't exist or
y’s balance is 0
• Or x and y both can do their part, but aren't sure if the other will
We want "atomic commit":
• A bunch of computers are cooperating on some task
• Each computer has a different role
• Want to ensure atomicity: all execute, or none execute
Challenges
• Failures, performance
We're going to use a protocol called "two-phase commit"
• Used by distributed databases for multi-server transactions
The setting
• Data is sharded among multiple servers
• Transactions run on "transaction coordinators" (TCs)
• For each read/write, TC sends RPC to relevant shard server
o Each is a "participant"
o Each participant manages locks for its shard of the data
• There may be many concurrent transactions
o TC assigns unique transaction ID (TID) to each transaction
§ Allows TC to resend commit messages
o Every message, every table entry tagged with TID
Two-phase commit (2PC)
• TC sends get(), put() RPCs to A, B
o A and B lock records
o Modifications are tentative, on a copy
• TC gets to the end of the transaction
• TC sends PREPARE messages to A and B
• If A is willing to commit
o A respond YES
o Then A is in "prepared" state
• Otherwise, A responds NO
• Same for B
• If both A and B say YES, TC sends COMMIT messages to A and B
o If either A or B says NO, TC sends ABORT messages
• A/B commit if they get a COMMIT message from the TC
o I.e., they write tentative records to the real DB
o And release the transaction's locks on their records
• A and B acknowledge COMMIT message
Correctness
• Neither A nor B can commit unless they both agreed
Fault Tolerance
1. Node Failure
What if B crashes and restarts?
• If B sent YES before crash, B must remember (despite crash)!
• Because TC might have sent commit message to A and A might have already
committed
• So, B must be able to commit (or not) even after a reboot
Thus, participants must write persistent (on-disk) state:
• B must remember on disk before saying YES, including modified data
• If B reboots, and disk says YES but no COMMIT
o B must ask TC or wait for TC to re-send
• And meanwhile, B must continue to hold the transaction's locks
• If TC says COMMIT, B copies modified data to real data
What if TC crashes and restarts?
• If TC might have sent COMMIT before crash, TC must remember!
o Since one worker may already have committed
• Thus, TC must write COMMIT to disk before sending COMMIT messages
• And repeat COMMIT if it crashes and reboots,
o Or if a participant asks (i.e., if A/B didn't get COMMIT msg)
• Participants must filter out duplicate COMMITs (using TID)
2. Network Failure
What if TC never gets a YES/NO from B?
• Perhaps B crashed and didn't recover; perhaps network is broken
• TC can time out, and abort (since has not sent any COMMIT messages)
• Good: allows servers to release locks
What if B never gets a PREPARE from TC?
• B has not yet responded to PREPARE, so TC can't have decided commit
• So, B can unilaterally abort, and release locks
o Shouldn’t hold lock for a long time
• Respond NO to future PREPARE
What if B replied YES to PREPARE, but doesn't receive COMMIT or ABORT?
• Can B unilaterally decide to abort?
o No! TC might have gotten YES from both, and sent out COMMIT to A,
but crashed before sending to B
o So, then A would commit, and B would abort: incorrect
• B can't unilaterally commit, either:
o A might have voted NO
So: if B voted YES, it must "block": wait for TC decision
Note:
• The commit/abort decision is made by a single entity => the TC
• This makes two-phase commit relatively straightforward
• The penalty is that A/B, after voting YES, must wait for the TC
When can TC completely forget about a committed transaction?
• If it sees an acknowledgement from every participant for the COMMIT
• Then no participant will ever need to ask again
When can participant completely forgets about a committed transaction?
• After it acknowledges the TC's COMMIT message
• If it gets another COMMIT, and has no record of the transaction,
o It must have already committed and forgotten and can acknowledge
(again)
Two-phase commit perspective
• Used in sharded DBs when a transaction uses data on multiple shards
• But it has a bad reputation:
o Slow: multiple rounds of messages
o Slow: disk writes
o Locks are held over the prepare/commit exchanges
§ Blocks other transactions
o TC crash can cause indefinite blocking, with locks held
• Thus, usually used only in a single small domain
o E.g., not between banks, not between airlines, not over wide area
networks
• Faster distributed transactions are active research area
Raft and two-phase commit solve different problems!
• Use Raft to get high availability by replicating
o I.e., to be able to operate when some servers are crashed
o The servers all do the *same* thing
• Use 2PC when each participant does something different
o And *all* of them must do their part
• 2PC does not help availability
o Since all servers must be up to get anything done
• Raft does not ensure that all servers do something
o Since only a majority have to be alive
What if you want high availability *and* atomic commit?
• Here's one plan
• The TC and servers should each be replicated with Raft
• Run two-phase commit among the replicated services
• Then you can tolerate failures and still make progress
• Google Spanner uses this arrangement