0% found this document useful (0 votes)
20 views4 pages

Dynamodb Part 2

The document discusses the architecture and techniques used in Dynamo, a decentralized storage system designed for high availability and scalability. It highlights key features such as consistent hashing for partitioning, vector clocks for versioning, and the use of a gossip-based protocol for membership and failure detection. Dynamo's design allows for eventual consistency, enabling applications to handle concurrent writes and temporary failures without rejecting updates.

Uploaded by

Sandeep Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views4 pages

Dynamodb Part 2

The document discusses the architecture and techniques used in Dynamo, a decentralized storage system designed for high availability and scalability. It highlights key features such as consistent hashing for partitioning, vector clocks for versioning, and the use of a gossip-based protocol for membership and failure detection. Dynamo's design allows for eventual consistency, enabling applications to handle concurrent writes and temporary failures without rejecting updates.

Uploaded by

Sandeep Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Table 1: Summary of techniques used in Dynamo and

Key K their advantages.


A Problem Technique Advantage
G Partitioning Consistent Hashing Incremental
B Scalability
Nodes B, C
and D store High Availability Vector clocks with Version size is
keys in for writes reconciliation during decoupled from
F C range (A,B) reads update rates.
including
Handling temporary Sloppy Quorum and Provides high
K.
failures hinted handoff availability and
E D durability guarantee
when some of the
replicas are not
available.
Figure 2: Partitioning and replication of keys in Dynamo Recovering from Anti-entropy using Synchronizes
ring. permanent failures Merkle trees divergent replicas in
the background.
Traditional replicated relational database systems focus on the
problem of guaranteeing strong consistency to replicated data. Membership and Gossip-based Preserves symmetry
Although strong consistency provides the application writer a failure detection membership protocol and avoids having a
and failure detection. centralized registry
convenient programming model, these systems are limited in
for storing
scalability and availability [7]. These systems are not capable of membership and
handling network partitions because they typically provide strong node liveness
consistency guarantees. information.

3.3 Discussion
Dynamo differs from the aforementioned decentralized storage Table 1 presents a summary of the list of techniques Dynamo uses
systems in terms of its target requirements. First, Dynamo is and their respective advantages.
targeted mainly at applications that need an “always writeable”
data store where no updates are rejected due to failures or 4.1 System Interface
concurrent writes. This is a crucial requirement for many Amazon Dynamo stores objects associated with a key through a simple
applications. Second, as noted earlier, Dynamo is built for an interface; it exposes two operations: get() and put(). The get(key)
infrastructure within a single administrative domain where all operation locates the object replicas associated with the key in the
nodes are assumed to be trusted. Third, applications that use storage system and returns a single object or a list of objects with
Dynamo do not require support for hierarchical namespaces (a conflicting versions along with a context. The put(key, context,
norm in many file systems) or complex relational schema object) operation determines where the replicas of the object
(supported by traditional databases). Fourth, Dynamo is built for should be placed based on the associated key, and writes the
latency sensitive applications that require at least 99.9% of read replicas to disk. The context encodes system metadata about the
and write operations to be performed within a few hundred object that is opaque to the caller and includes information such as
milliseconds. To meet these stringent latency requirements, it was the version of the object. The context information is stored along
imperative for us to avoid routing requests through multiple nodes with the object so that the system can verify the validity of the
(which is the typical design adopted by several distributed hash context object supplied in the put request.
table systems such as Chord and Pastry). This is because multi- Dynamo treats both the key and the object supplied by the caller
hop routing increases variability in response times, thereby as an opaque array of bytes. It applies a MD5 hash on the key to
increasing the latency at higher percentiles. Dynamo can be generate a 128-bit identifier, which is used to determine the
characterized as a zero-hop DHT, where each node maintains storage nodes that are responsible for serving the key.
enough routing information locally to route a request to the
appropriate node directly. 4.2 Partitioning Algorithm
One of the key design requirements for Dynamo is that it must
4. SYSTEM ARCHITECTURE scale incrementally. This requires a mechanism to dynamically
The architecture of a storage system that needs to operate in a partition the data over the set of nodes (i.e., storage hosts) in the
production setting is complex. In addition to the actual data system. Dynamo’s partitioning scheme relies on consistent
persistence component, the system needs to have scalable and hashing to distribute the load across multiple storage hosts. In
robust solutions for load balancing, membership and failure consistent hashing [10], the output range of a hash function is
detection, failure recovery, replica synchronization, overload treated as a fixed circular space or “ring” (i.e. the largest hash
handling, state transfer, concurrency and job scheduling, request value wraps around to the smallest hash value). Each node in the
marshalling, request routing, system monitoring and alarming, system is assigned a random value within this space which
and configuration management. Describing the details of each of represents its “position” on the ring. Each data item identified by
the solutions is not possible, so this paper focuses on the core a key is assigned to a node by hashing the data item’s key to yield
distributed systems techniques used in Dynamo: partitioning, its position on the ring, and then walking the ring clockwise to
replication, versioning, membership, failure handling and scaling. find the first node with a position larger than the item’s position.
Thus, each node becomes responsible for the region in the ring return to its caller before the update has been applied at all the
between it and its predecessor node on the ring. The principle replicas, which can result in scenarios where a subsequent get()
advantage of consistent hashing is that departure or arrival of a operation may return an object that does not have the latest
node only affects its immediate neighbors and other nodes remain updates.. If there are no failures then there is a bound on the
unaffected. update propagation times. However, under certain failure
scenarios (e.g., server outages or network partitions), updates may
The basic consistent hashing algorithm presents some challenges. not arrive at all replicas for an extended period of time.
First, the random position assignment of each node on the ring
leads to non-uniform data and load distribution. Second, the basic There is a category of applications in Amazon’s platform that can
algorithm is oblivious to the heterogeneity in the performance of tolerate such inconsistencies and can be constructed to operate
nodes. To address these issues, Dynamo uses a variant of under these conditions. For example, the shopping cart application
consistent hashing (similar to the one used in [10, 20]): instead of requires that an “Add to Cart” operation can never be forgotten or
mapping a node to a single point in the circle, each node gets rejected. If the most recent state of the cart is unavailable, and a
assigned to multiple points in the ring. To this end, Dynamo uses user makes changes to an older version of the cart, that change is
the concept of “virtual nodes”. A virtual node looks like a single still meaningful and should be preserved. But at the same time it
node in the system, but each node can be responsible for more shouldn’t supersede the currently unavailable state of the cart,
than one virtual node. Effectively, when a new node is added to which itself may contain changes that should be preserved. Note
the system, it is assigned multiple positions (henceforth, “tokens”) that both “add to cart” and “delete item from cart” operations are
in the ring. The process of fine-tuning Dynamo’s partitioning translated into put requests to Dynamo. When a customer wants to
scheme is discussed in Section 6. add an item to (or remove from) a shopping cart and the latest
version is not available, the item is added to (or removed from)
Using virtual nodes has the following advantages: the older version and the divergent versions are reconciled later.
• If a node becomes unavailable (due to failures or routine In order to provide this kind of guarantee, Dynamo treats the
maintenance), the load handled by this node is evenly result of each modification as a new and immutable version of the
dispersed across the remaining available nodes. data. It allows for multiple versions of an object to be present in
• When a node becomes available again, or a new node is the system at the same time. Most of the time, new versions
added to the system, the newly available node accepts a subsume the previous version(s), and the system itself can
roughly equivalent amount of load from each of the other determine the authoritative version (syntactic reconciliation).
available nodes. However, version branching may happen, in the presence of
failures combined with concurrent updates, resulting in
• The number of virtual nodes that a node is responsible can conflicting versions of an object. In these cases, the system cannot
decided based on its capacity, accounting for heterogeneity reconcile the multiple versions of the same object and the client
in the physical infrastructure. must perform the reconciliation in order to collapse multiple
branches of data evolution back into one (semantic
4.3 Replication reconciliation). A typical example of a collapse operation is
To achieve high availability and durability, Dynamo replicates its “merging” different versions of a customer’s shopping cart. Using
data on multiple hosts. Each data item is replicated at N hosts, this reconciliation mechanism, an “add to cart” operation is never
where N is a parameter configured “per-instance”. Each key, k, is lost. However, deleted items can resurface.
assigned to a coordinator node (described in the previous section).
The coordinator is in charge of the replication of the data items It is important to understand that certain failure modes can
that fall within its range. In addition to locally storing each key potentially result in the system having not just two but several
within its range, the coordinator replicates these keys at the N-1 versions of the same data. Updates in the presence of network
clockwise successor nodes in the ring. This results in a system partitions and node failures can potentially result in an object
where each node is responsible for the region of the ring between having distinct version sub-histories, which the system will need
it and its Nth predecessor. In Figure 2, node B replicates the key k to reconcile in the future. This requires us to design applications
at nodes C and D in addition to storing it locally. Node D will that explicitly acknowledge the possibility of multiple versions of
store the keys that fall in the ranges (A, B], (B, C], and (C, D]. the same data (in order to never lose any updates).

The list of nodes that is responsible for storing a particular key is Dynamo uses vector clocks [12] in order to capture causality
called the preference list. The system is designed, as will be between different versions of the same object. A vector clock is
explained in Section 4.8, so that every node in the system can effectively a list of (node, counter) pairs. One vector clock is
determine which nodes should be in this list for any particular associated with every version of every object. One can determine
key. To account for node failures, preference list contains more whether two versions of an object are on parallel branches or have
than N nodes. Note that with the use of virtual nodes, it is possible a causal ordering, by examine their vector clocks. If the counters
that the first N successor positions for a particular key may be on the first object’s clock are less-than-or-equal to all of the nodes
owned by less than N distinct physical nodes (i.e. a node may in the second clock, then the first is an ancestor of the second and
hold more than one of the first N positions). To address this, the can be forgotten. Otherwise, the two changes are considered to be
preference list for a key is constructed by skipping positions in the in conflict and require reconciliation.
ring to ensure that the list contains only distinct physical nodes. In Dynamo, when a client wishes to update an object, it must
specify which version it is updating. This is done by passing the
4.4 Data Versioning context it obtained from an earlier read operation, which contains
Dynamo provides eventual consistency, which allows for updates the vector clock information. Upon processing a read request, if
to be propagated to all replicas asynchronously. A put() call may
object. In practice, this is not likely because the writes are usually
handled by one of the top N nodes in the preference list. In case of
network partitions or multiple server failures, write requests may
be handled by nodes that are not in the top N nodes in the
preference list causing the size of vector clock to grow. In these
scenarios, it is desirable to limit the size of vector clock. To this
end, Dynamo employs the following clock truncation scheme:
Along with each (node, counter) pair, Dynamo stores a timestamp
that indicates the last time the node updated the data item. When
the number of (node, counter) pairs in the vector clock reaches a
threshold (say 10), the oldest pair is removed from the clock.
Clearly, this truncation scheme can lead to inefficiencies in
reconciliation as the descendant relationships cannot be derived
accurately. However, this problem has not surfaced in production
and therefore this issue has not been thoroughly investigated.

4.5 Execution of get () and put () operations


Any storage node in Dynamo is eligible to receive client get and
put operations for any key. In this section, for sake of simplicity,
we describe how these operations are performed in a failure-free
Figure 3: Version evolution of an object over time. environment and in the subsequent section we describe how read
and write operations are executed during failures.
Dynamo has access to multiple branches that cannot be
syntactically reconciled, it will return all the objects at the leaves, Both get and put operations are invoked using Amazon’s
with the corresponding version information in the context. An infrastructure-specific request processing framework over HTTP.
update using this context is considered to have reconciled the There are two strategies that a client can use to select a node: (1)
divergent versions and the branches are collapsed into a single route its request through a generic load balancer that will select a
new version. node based on load information, or (2) use a partition-aware client
library that routes requests directly to the appropriate coordinator
To illustrate the use of vector clocks, let us consider the example nodes. The advantage of the first approach is that the client does
shown in Figure 3. A client writes a new object. The node (say not have to link any code specific to Dynamo in its application,
Sx) that handles the write for this key increases its sequence whereas the second strategy can achieve lower latency because it
number and uses it to create the data's vector clock. The system skips a potential forwarding step.
now has the object D1 and its associated clock [(Sx, 1)]. The
client updates the object. Assume the same node handles this A node handling a read or write operation is known as the
request as well. The system now also has object D2 and its coordinator. Typically, this is the first among the top N nodes in
associated clock [(Sx, 2)]. D2 descends from D1 and therefore the preference list. If the requests are received through a load
over-writes D1, however there may be replicas of D1 lingering at balancer, requests to access a key may be routed to any random
nodes that have not yet seen D2. Let us assume that the same node in the ring. In this scenario, the node that receives the
client updates the object again and a different server (say Sy) request will not coordinate it if the node is not in the top N of the
handles the request. The system now has data D3 and its requested key’s preference list. Instead, that node will forward the
associated clock [(Sx, 2), (Sy, 1)]. request to the first among the top N nodes in the preference list.

Next assume a different client reads D2 and then tries to update it, Read and write operations involve the first N healthy nodes in the
and another node (say Sz) does the write. The system now has D4 preference list, skipping over those that are down or inaccessible.
(descendant of D2) whose version clock is [(Sx, 2), (Sz, 1)]. A When all nodes are healthy, the top N nodes in a key’s preference
node that is aware of D1 or D2 could determine, upon receiving list are accessed. When there are node failures or network
D4 and its clock, that D1 and D2 are overwritten by the new data partitions, nodes that are lower ranked in the preference list are
and can be garbage collected. A node that is aware of D3 and accessed.
receives D4 will find that there is no causal relation between To maintain consistency among its replicas, Dynamo uses a
them. In other words, there are changes in D3 and D4 that are not consistency protocol similar to those used in quorum systems.
reflected in each other. Both versions of the data must be kept and This protocol has two key configurable values: R and W. R is the
presented to a client (upon a read) for semantic reconciliation. minimum number of nodes that must participate in a successful
Now assume some client reads both D3 and D4 (the context will read operation. W is the minimum number of nodes that must
reflect that both values were found by the read). The read's participate in a successful write operation. Setting R and W such
context is a summary of the clocks of D3 and D4, namely [(Sx, 2), that R + W > N yields a quorum-like system. In this model, the
(Sy, 1), (Sz, 1)]. If the client performs the reconciliation and node latency of a get (or put) operation is dictated by the slowest of the
Sx coordinates the write, Sx will update its sequence number in R (or W) replicas. For this reason, R and W are usually
the clock. The new data D5 will have the following clock: [(Sx, configured to be less than N, to provide better latency.
3), (Sy, 1), (Sz, 1)]. Upon receiving a put() request for a key, the coordinator generates
A possible issue with vector clocks is that the size of vector the vector clock for the new version and writes the new version
clocks may grow if many servers coordinate the writes to an locally. The coordinator then sends the new version (along with
the new vector clock) to the N highest-ranked reachable nodes. If the original replica node. To handle this and other threats to
at least W-1 nodes respond then the write is considered durability, Dynamo implements an anti-entropy (replica
successful. synchronization) protocol to keep the replicas synchronized.
Similarly, for a get() request, the coordinator requests all existing To detect the inconsistencies between replicas faster and to
versions of data for that key from the N highest-ranked reachable minimize the amount of transferred data, Dynamo uses Merkle
nodes in the preference list for that key, and then waits for R trees [13]. A Merkle tree is a hash tree where leaves are hashes of
responses before returning the result to the client. If the the values of individual keys. Parent nodes higher in the tree are
coordinator ends up gathering multiple versions of the data, it hashes of their respective children. The principal advantage of
returns all the versions it deems to be causally unrelated. The Merkle tree is that each branch of the tree can be checked
divergent versions are then reconciled and the reconciled version independently without requiring nodes to download the entire tree
superseding the current versions is written back. or the entire data set. Moreover, Merkle trees help in reducing the
amount of data that needs to be transferred while checking for
4.6 Handling Failures: Hinted Handoff inconsistencies among replicas. For instance, if the hash values of
If Dynamo used a traditional quorum approach it would be the root of two trees are equal, then the values of the leaf nodes in
unavailable during server failures and network partitions, and the tree are equal and the nodes require no synchronization. If not,
would have reduced durability even under the simplest of failure it implies that the values of some replicas are different. In such
conditions. To remedy this it does not enforce strict quorum cases, the nodes may exchange the hash values of children and the
membership and instead it uses a “sloppy quorum”; all read and process continues until it reaches the leaves of the trees, at which
write operations are performed on the first N healthy nodes from point the hosts can identify the keys that are “out of sync”. Merkle
the preference list, which may not always be the first N nodes trees minimize the amount of data that needs to be transferred for
encountered while walking the consistent hashing ring. synchronization and reduce the number of disk reads performed
during the anti-entropy process.
Consider the example of Dynamo configuration given in Figure 2
with N=3. In this example, if node A is temporarily down or Dynamo uses Merkle trees for anti-entropy as follows: Each node
unreachable during a write operation then a replica that would maintains a separate Merkle tree for each key range (the set of
normally have lived on A will now be sent to node D. This is done keys covered by a virtual node) it hosts. This allows nodes to
to maintain the desired availability and durability guarantees. The compare whether the keys within a key range are up-to-date. In
replica sent to D will have a hint in its metadata that suggests this scheme, two nodes exchange the root of the Merkle tree
which node was the intended recipient of the replica (in this case corresponding to the key ranges that they host in common.
A). Nodes that receive hinted replicas will keep them in a Subsequently, using the tree traversal scheme described above the
separate local database that is scanned periodically. Upon nodes determine if they have any differences and perform the
detecting that A has recovered, D will attempt to deliver the appropriate synchronization action. The disadvantage with this
replica to A. Once the transfer succeeds, D may delete the object scheme is that many key ranges change when a node joins or
from its local store without decreasing the total number of replicas leaves the system thereby requiring the tree(s) to be recalculated.
in the system. This issue is addressed, however, by the refined partitioning
scheme described in Section 6.2.
Using hinted handoff, Dynamo ensures that the read and write
operations are not failed due to temporary node or network 4.8 Membership and Failure Detection
failures. Applications that need the highest level of availability
can set W to 1, which ensures that a write is accepted as long as a 4.8.1 Ring Membership
single node in the system has durably written the key it to its local In Amazon’s environment node outages (due to failures and
store. Thus, the write request is only rejected if all nodes in the maintenance tasks) are often transient but may last for extended
system are unavailable. However, in practice, most Amazon intervals. A node outage rarely signifies a permanent departure
services in production set a higher W to meet the desired level of and therefore should not result in rebalancing of the partition
durability. A more detailed discussion of configuring N, R and W assignment or repair of the unreachable replicas. Similarly,
follows in section 6. manual error could result in the unintentional startup of new
Dynamo nodes. For these reasons, it was deemed appropriate to
It is imperative that a highly available storage system be capable use an explicit mechanism to initiate the addition and removal of
of handling the failure of an entire data center(s). Data center nodes from a Dynamo ring. An administrator uses a command
failures happen due to power outages, cooling failures, network line tool or a browser to connect to a Dynamo node and issue a
failures, and natural disasters. Dynamo is configured such that membership change to join a node to a ring or remove a node
each object is replicated across multiple data centers. In essence, from a ring. The node that serves the request writes the
the preference list of a key is constructed such that the storage membership change and its time of issue to persistent store. The
nodes are spread across multiple data centers. These datacenters membership changes form a history because nodes can be
are connected through high speed network links. This scheme of removed and added back multiple times. A gossip-based protocol
replicating across multiple datacenters allows us to handle entire propagates membership changes and maintains an eventually
data center failures without a data outage. consistent view of membership. Each node contacts a peer chosen
at random every second and the two nodes efficiently reconcile
4.7 Handling permanent failures: Replica their persisted membership change histories.
synchronization
Hinted handoff works best if the system membership churn is low When a node starts for the first time, it chooses its set of tokens
and node failures are transient. There are scenarios under which (virtual nodes in the consistent hash space) and maps nodes to
hinted replicas become unavailable before they can be returned to their respective token sets. The mapping is persisted on disk and

You might also like