MUREX:
A Mutable Replica Control
Scheme for Structured
Peer-to-Peer Storage Systems
Presented by
Jehn-Ruey Jiang
National Central University
Taiwan, R. O. C.
Outline
P2P Storage Systems
The Problems
MUREX
Analysis and Simulation
Conclusion
2/40
Outline
P2P Storage Systems
The Problems
MUREX
Analysis and Simulation
Conclusion
3/40
P2P Storage Systems
To aggregate idle storage across the
Internet to be a huge storage space
Towards Global Storage Systems
Massive Nodes
Massive Capacity
4/40
Unstructured vs. Structured
Unstructured:
No restriction on the interconnection of nodes
Easy to build but not scalable
Our Focus!!
Structured:
Based on DHTs (Distributed Hash Tables)
More scalable
5/40
Non-Mutable vs. Mutable
Non-Mutable (Read-only):
CFS
PAST
Charles
Mutable: Our Focus!!
Ivy
Eliot
Oasis
Om
6/40
Replication
Data objects are replicated for the purpose
of fault-tolerance high data availability
Some DHTs have provided replication
utilities, which are usually used to replicate
routing states
The proposed protocol replicates data
objects in the application layer so that it
can be built on top of any DHT
7/40
One-Copy Equivalence
Data consistency Criterion
The set of replicas must behave as if
there were only a single copy
Conditions:
1. no pair of write operations can proceed at the
same time,
2. no pair of a read operation and a write
operation can proceed at the same time,
3. a read operation always returns the replica
that the last write operation writes.
8/40
Synchronous vs. Asynchronous
Our Focus
Synchronous Replication
Each write operation should finish updating all
replicas before the next write operation proceeds.
Strict data consistency
Long operation latency
Asynchronous Replication
A write operation is written to the local replica; data
object is then asynchronously written to other replicas.
May violate data consistency
Shorter latency
Log-based mechanisms to roll back the system
9/40
Fault Models
Fail-Stop
Nodes just stop functioning when they fail
Crash-Recovery
Failures are detectable
Nodes can recover and rejoin the system after
state synchronization
Byzantine
Nodes may act arbitrary
10/40
Outline
P2P Storage Systems
The Problems
MUREX
Analysis and Simulation
Conclusion
11/40
Three Problems
Replica migration
Replica acquisition
State synchronization
12/40
DHT – Node Joining
s Data Object
Hash Function
ks
0 2128-1
Hashed Key SpaceReplica Migration
Peer
Nodes u v
node joining
13/40
DHT – Node Leaving
r Data Object
Hash Function
kr
0 Replica Acquisition 2128-1
Hashed Key Space
rr Data Object
Peer State Synchronization
Nodes p q
node leaving
14/40
Outline
P2P Storage Systems
The Problems
MUREX
Analysis and Simulation
Conclusion
15/40
The Solution - MUREX
A mutable replica control scheme
Keeping one-copy equivalence for
synchronous P2P storage replication under
the crash-recovery fault model
Based on Multi-column read/write quorums
16/40
Operations
Publish(CON, DON)
CON: Standing for CONtent
DON: Standing for Data Object Name
Read(DON)
Write(CON, DON)
17/40
Synchronous Replication
n replicas for each data object
K1=HASH1(Data Object Name), …,
Kn=HASHn(Data Object Name)
Using read/write quorums to maintain data
consistency (one-copy equivalence)
18/40
Data Replication Data Object
replica
replican2
replica 1
r
Hash Function 1 Hash Function 2 … Hash Function n
k2 k1 kn
0 2128-1
Hashed Key Space
Peer
Nodes
19/40
Quorum-Based Schemes (1/2)
High data availability and low communication
cost
n replicas with version numbers
Read operation
Read-lock and access a read quorum
Obtaining a largest-version-number replica
Write operation the largest
+1
Write-lockand access a write quorum
Updating all replicas with the new version number
20/40
Quorum-Based Schemes (2/2)
One-copy equivalence is guaranteed If
we restrict
Write-write and write-read lock exclusion
Intersection Property
A non-empty intersection in any pair of
Aread quorum and a write quorum
Two write quorums
21/40
Multi-Column Quorums
Smallest quorums: constant-sized
quorums in the best case
Smaller quorums imply lower communication
cost
May achieve the highest data availability
22/40
Messages
LOCK (WLOCK/RLOCK)
OK
WAIT
MISS
UNLOCK
23/40
Algorithms for Quorum Construction
24/40
Three Mechanisms
Replica pointer
On-demand replica regeneration
Leased lock
25/40
Replica pointer
A lightweight mechanism to emigrate
replicas
A five-tuple:
(hashed key, data object name, version
number, lock state, actual storing location)
It is produced when a replica is first
generated.
It is moved between nodes instead of the
actual data object,
26/40
On-demand replica regeneration (1/2)
When node p receives LOCK from node u,
it sends a MISS if it
does not have the replica pointer
has the replica pointer which indicates that v
stores the replica, but v is not alive
After executing the desired read/write
operation, node u will send the newest
replica obtained/generated to node p
27/40
On-demand replica regeneration (2/2)
Acquiring replicas only when they are
requested
Dummy read operation
Performed periodically for rarely-accessed
data object
To check if replicas of data object are still
alive
To re-disseminate replicas to proper nodes to
keep data persistency
28/40
Leased lock (1/2)
A lock expires after a lease period of L
A node should release all locks if it is not in CS
and H>L-C-D holds.
H: The holding time of the lock
D: The propagation delay
C: time to be in CS
29/40
Leased lock (2/2)
When releasing all locks, a node
starts over to request locks after a
random backoff time
If a node starts to substitute another
node at time T, a newly acquired
replica can start to reply to LOCK
requests at time T+L
30/40
Correctness
Theorem 1. (Safety Property)
MUREX ensures the one-copy equivalence
consistency criterion
Theorem 2. (Liveness Property)
There
is neither deadlock nor starvation in
MUREX
31/40
Outline
P2P Storage Systems
The Problems
MUREX
Analysis and Simulation
Conclusion
32/40
Communication Cost
If no contention
In the best case: 3s messages
One LOCK
s: the size of the
One OK
last column of
One UNLOCK multi-column
quorums
When failures occur
Communication cost increases gradually
In the worst case: O(n) messages
A node sends LOCK message to all n replicas
(there are related UNLOCK, OK, WAIT messages)
33/40
Simulation
Environment
The underlying DHT is Tornado
For quorums under four multi-column structures
MC(5, 3), MC(4, 3), MC(5, 2) and MC(4, 2)
ForMC(m, s), the leased period is assumed to
be m*(turn-around time)
2000 nodes in the system
Simulation for 3000 seconds
10000 operations are requested
Half for reading and half for writing
Each request is assumed to be destined for a
random file (data object) 34/40
Simulation Result 1
The probability that a node succeeds to perform the
desired operation before the leased lock expires
1st experiment: no node join or leave
Degree of Contention 35/40
Simulation Result 2
2nd experiment: 200 out of 2000 nodes
may join/leave at will
36/40
Simulation Result 3
3rd experiment: 0, 50, 100 or 200 out of
2000 nodes may leave
37/40
Outline
P2P Storage Systems
The Problems
MUREX
Analysis and Simulation
Conclusion
38/40
Conclusion
Identify three problems for synchronous
replication in P2P mutable storage systems
Replica migration
Replica acquisition
State synchronization
Propose MUREX to solve the problems by
Multi-column read/write quorums
Replica pointer
On-demand replica regeneration
Leased lock 39/40
Thanks!!
40/40