Consistency without
concurrency control
in large,Marcdynamic systems
Shapiro, INRIA & LIP6
Nuno Preguiça, Universidade Nova de Lisboa
Mihai Leția, ENS Lyon
Consistency without
concurrency control
f (x1) g(x1)
x
x1
g(x2)
x2
g(x3) f (x3)
x3
Object x, operation f(x)
propose f(x1)
eventually replayf(x2), f(x3), ...
If f || g commute: converges safely
without concurrency control
Commutative Replicated Data Type
(CRDT): Designed for commutative
Consistency without concurrency control in large, dynamic systems
A sequence CRDT
Treedoc = sequence of elements:
insert-at-pos, delete
Commutative when concurrent
Minimise overhead
Scalable
A commutative replicated data type
for cooperative editing, ICDCS
2009
Focus today:
Garbage collection
vs. scale
Consistency without concurrency control in large, dynamic systems
Commutative updates
R
0 1
I A
0 1 0
L N I
’
=LLI’ N R I A
Naming tree: minimal, self-adjusting:
logarithmic
TID: path = [0|1]*
Contents: infix order
insert adds leaf ⇒ non-destructive, TIDs don’t
change
Delete: tombstone, TIDs don't change
Consistency without concurrency control in large, dynamic systems
Wikipedia GWB page: space
overhead
kB
serialised
Treedoc
wikido
c
×10 revisions
Consistency without concurrency control in large, dynamic systems
Rebalance
R R
I A I A
L N I L N I
’ ’
=L'INR =L'INR
I I
Consistency without concurrency control in large, dynamic systems
Rebalance
N R
’ I I A
!!
L I R • L N I
!
’
=L'INR =L'INR
I I !!!
Invalidates TIDs:
Frame of reference = epoch
Requires agreement
Pervasive!
e.g. Vector Clocks
Consistency without concurrency control in large, dynamic systems
Rebalance in large, dynamic
systems
Rebalance requires consensus
Consensus requires small, stable
membership
Large communities?!
Dynamic scenarios?!
Solution: two tiers
Core: rebalancing (and updates)
Nebula: updates (and rebalancing)
Migration protocol
Consistency without concurrency control in large, dynamic systems
Core Nebula
Group membership
Small, stable
Rebalance: Arbitrary membership
Unanimous agreement Large, dynamic
(2-phase commit) Communicate with sites in
All core sites in same same epoch only
epoch Catch-up to rebalance,
join core epoch
Consistency without concurrency control in large, dynamic systems
Core Nebula
Group membership
Arbitrary membership
Small, stable
Large, dynamic
Rebalance:
Communicate with sites in
Unanimous agreement
same epoch only
(2-phase commit)
Catch-up to rebalance,
All core sites in same join core epoch
epoch
Consistency without concurrency control in large, dynamic systems
Catch-up protocol summary
Core Nebula
Send old updates
replay core's updates
Send rebalance
Replay rebalance,
ignoring nebula updates.
Transform nebula updates.
Send transformed updates.
Replay nebula
updates
Consistency without concurrency control in large, dynamic systems
Catch-up protocol
Core Site Nebula Site ins(L,00
del(1 )
) R R ins(',001
)
I A I A
N I L N I
del(1) ins(L,00)
ins(',001)
Consistency without concurrency control in large, dynamic systems
Catch-up protocol
Core Site Nebula Site ins(L,00
rebalanc )
e R R ins(',001
)
I A I A
N I L N I
del(1) ins(L,00)
rebalance ins(',001)
Consistency without concurrency control in large, dynamic systems
Catch-up protocol
Core Site Nebula Site
I R
N • I A
I R • • L N I
del(1) ins(L,00)
rebalance ins(',001)
Consistency without concurrency control in large, dynamic systems
Catch-up protocol
Core Site Nebula Site
I R
N • I A
I R • • L N I
del(1)
del(1 ins(L,00)
)
rebalanc
rebalance ins(',001)
e
Consistency without concurrency control in large, dynamic systems
Catch-up protocol
Core Site Nebula Site
I I
N • N •
I R • • I R • •
L L
’ ins(L,000
’ ins(L,000)
)
ins(',0001)
Consistency without concurrency control in large, dynamic systems
ins(',0001
Summary
CRDT:
Convergence ensured
Design for commutativity
GC cannot be ignored
Requires commitment
Pervasive issue
Large-scale commitment:
Core / Nebula
To synchronise: catch-up +
migration
Consistency without concurrency control in large, dynamic systems
Future work
More CRDTs
Understanding CRDTs: what invariants can be
CRDTized
Approximations of CRDTs
Data types for consistent cloud computing
without concurrency control
Consistency without concurrency control in large, dynamic systems
Consistency without concurrency control in large, dynamic systems
Replicated sequence
INRI INRI
A A
insert (0, delete
"L") (5)
LINRI INR
A I
insert (1,
"'")
L'INRI insert (0,
A "L")(1,
delete insert
(5) "'")
L'INI L'INR
A I
Consistency without concurrency control in large, dynamic systems
State of the art
Serialising updates
Single, global execution order
Lock-step: Poor user experience
Doesn't scale
Operational Transformation
Local execution orders
Modify arguments to take into account
concurrent operations scheduled before
Weak theory, hidden complexity
Insight: design data type to be commutative
Consistency without concurrency control in large, dynamic systems
Commutative Replicated
Data Type (CRDT)
Assuming:
All concurrent operations commute
Non-concurrent operations execute in happens-
before order
All operations eventually execute at every
replica
Then replicas eventually converge to correct
value
Design data types to support commutative
operations
Consistency without concurrency control in large, dynamic systems
Concurrent inserts
Exceptions to binary tree: disambiguator
Concurrent inserts ordered by disambiguator
Path = site-ID? < [0|1], disambiguator? >*
Alternatives:
site identifier of initiator: short, but delete
leaves a tombstone
or: Unique ID of operation: long, immediate
delete
Consistency without concurrency control in large, dynamic systems
Causal ordering
Vector clocks:
Number of messages received from each site
Causal ordering
Filter duplicate messages
Efficient but grow indefinitely
Treedoc
TID encodes causal order
Duplicates idempotent
Approximate Vector Clocks + Treedoc
Consistency without concurrency control in large, dynamic systems
Rebalance requires
commitment
Commutativity of update || rebalance
Updates win
Rebalance does not impact update performance
Rebalance: unanimous agreement
Standard 2- or 3-phase commit
Initiator is coordinator
Other site: If concurrent update: “Not OK”
Off critical path!
Consistency without concurrency control in large, dynamic systems
Experiments
Estimate overheads, compare design
alternatives:
Atom granularity: word vs. line
Disambiguator: siteID+tombstone vs. unique
ID+immediate delete
Are trees unbalanced?
Effect of rebalance, heuristics
Replay update history of CVS, SVN, Wiki
repositories
Consistency without concurrency control in large, dynamic systems
Implementation alternatives
Disambiguator: next slide
Atom: character, word, line, paragraph
Fine-grain: structural overhead
Coarse-grain: “delete” artefacts
Data structure:
Tree: no TID storage, no GC interior nodes
vs. { (TID, atom) }: flexible, GC
Mixed
Arity: binary vs. 256-ary
Consistency without concurrency control in large, dynamic systems
Disambiguator design
Disambiguator options:
1 byte, no GC until rebalance
or 10 bytes, immediate GC (if leaf)
IStored in n°
thought every
1... node
but I was wrong
Intuitively which do you think is best?
Consistency without concurrency control in large, dynamic systems
Latex files
Consistency without concurrency control in large, dynamic systems
Latex / line
Consistency without concurrency control in large, dynamic systems
Summarize: % deleted
nodes (mean)
Consistency without concurrency control in large, dynamic systems
Atom granularity and deletes
% deleted longest TID avg TID
character 73.61 53 27.907
word 69.27 16 6.877
paragraph 81.90 11 4.125
(200 revisions
only)
Consistency without concurrency control in large, dynamic systems
Wikipedia GWB benchmark
[Link]/George_W_Bush
150 kB text
42,000 revisions: most frequently revised
Biggest revision: 100 kB
Benchmark data
Treedoc node = paragraph
First 15,000 revisions = 350,000 updates
Biggest revision < 2 s; average: 0.24 s/revision
Rebalance every 1,000 revisions
256-ary tree
Consistency without concurrency control in large, dynamic systems
Wikipedia GWB page
nodes
tombstones
liv
e ×1000
ops
Consistency without concurrency control in large, dynamic systems
Time per operation
μs
no rebalance
with rebalance
×1000 ops
Consistency without concurrency control in large, dynamic systems
Size, frequency of TIDs
number number
bytes [0,255]
relative
Consistency without concurrency control in large, dynamic systems
flat vs. WOOT vs. Treedoc
Consistency without concurrency control in large, dynamic systems
Summary: garbage collection
Efficiency, GC are important
Tree + re-balance
Requires commitment (move off
critical path)
Pervasive issue
Large-scale commitment:
Core: commitment
Nebula: asynchronous updates
Occasional synchronisation:
migration
Consistency without concurrency control in large, dynamic systems
Summary: CRDT
CRDT:
Convergence ensured
Design for commutativity
Techniques for commutativity:
Partial order
Non-destructive updates
Identifiers don't depend on
concurrent activity
Consensus: off critical path
Consistency without concurrency control in large, dynamic systems
Commutativity:
Genuine vs.u, vprecedence
Commutativity: ∀S, : S.u.v ≡ S.v.u
Genuine: both u and v take effect
“Addition, subtraction commutative”
Non-destructive updates
Precedence: only one of u, v ultimately takes
effect
“File writes commute under Last-Writer-Wins”
Destructive updates
Consistency without concurrency control in large, dynamic systems
Future work
Integrate with real editor
(Single-document) Transactions
Generalisation
Characterise invariants
More data types: set, multilog, others?
Mixed data types: improve
cf. persistent data structures
When consensus required:
Move off critical path
Speculation + conflict resolution
Consistency without concurrency control in large, dynamic systems