0% found this document useful (0 votes)
18 views4 pages

Dynamodb Part 4

The document discusses Dynamo, a highly available and scalable data store used by Amazon, focusing on its trade-offs between consistency and availability, particularly in the context of divergent versions arising from failures and concurrent writes. It highlights the efficiency of different strategies for load balancing and request coordination, with client-driven coordination showing significant latency improvements over server-driven approaches. The paper concludes that Dynamo's design allows for high availability and performance while enabling service owners to customize their storage systems based on specific needs.

Uploaded by

Sandeep Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Dynamodb Part 4

The document discusses Dynamo, a highly available and scalable data store used by Amazon, focusing on its trade-offs between consistency and availability, particularly in the context of divergent versions arising from failures and concurrent writes. It highlights the efficiency of different strategies for load balancing and request coordination, with client-driven coordination showing significant latency improvements over server-driven approaches. The paper concludes that Dynamo's design allows for high availability and performance while enabling service owners to customize their storage systems based on specific needs.

Uploaded by

Sandeep Naidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1 6.

3 Divergent Versions: When and How


0.9
Many?
As noted earlier, Dynamo is designed to tradeoff consistency for
Efficieny (mean load/max load)

0.8 availability. To understand the precise impact of different failures


on consistency, detailed data is required on multiple factors:
0.7 outage length, type of failure, component reliability, workload etc.
Presenting these numbers in detail is outside of the scope of this
0.6
paper. However, this section discusses a good summary metric:
0.5
Strategy 1 the number of divergent versions seen by the application in a live
Strategy 2 production environment.
Strategy 3
0.4
0 5000 10000 15000 20000 25000 30000 35000
Divergent versions of a data item arise in two scenarios. The first
Size of me tadata maintained at each node (in abstract units) is when the system is facing failure scenarios such as node
failures, data center failures, and network partitions. The second is
when the system is handling a large number of concurrent writers
Figure 8: Comparison of the load distribution efficiency of to a single data item and multiple nodes end up coordinating the
different strategies for system with 30 nodes and N=3 with updates concurrently. From both a usability and efficiency
equal amount of metadata maintained at each node. The perspective, it is preferred to keep the number of divergent
values of the system size and number of replicas are based on versions at any given time as low as possible. If the versions
the typical configuration deployed for majority of our cannot be syntactically reconciled based on vector clocks alone,
services. they have to be passed to the business logic for semantic
reconciliation. Semantic reconciliation introduces additional load
evaluate the skew in their load distribution while all strategies use on services, so it is desirable to minimize the need for it.
the same amount of space to maintain their membership
information. For instance, in strategy 1 each node needs to In our next experiment, the number of versions returned to the
maintain the token positions of all the nodes in the ring and in shopping cart service was profiled for a period of 24 hours.
strategy 3 each node needs to maintain the information regarding During this period, 99.94% of requests saw exactly one version;
the partitions assigned to each node. 0.00057% of requests saw 2 versions; 0.00047% of requests saw 3
versions and 0.00009% of requests saw 4 versions. This shows
In our next experiment, these strategies were evaluated by varying that divergent versions are created rarely.
the relevant parameters (T and Q). The load balancing efficiency
of each strategy was measured for different sizes of membership Experience shows that the increase in the number of divergent
information that needs to be maintained at each node, where Load versions is contributed not by failures but due to the increase in
balancing efficiency is defined as the ratio of average number of number of concurrent writers. The increase in the number of
requests served by each node to the maximum number of requests concurrent writes is usually triggered by busy robots (automated
served by the hottest node. client programs) and rarely by humans. This issue is not discussed
in detail due to the sensitive nature of the story.
The results are given in Figure 8. As seen in the figure, strategy 3
achieves the best load balancing efficiency and strategy 2 has the 6.4 Client-driven or Server-driven
worst load balancing efficiency. For a brief time, Strategy 2 Coordination
served as an interim setup during the process of migrating As mentioned in Section 5, Dynamo has a request coordination
Dynamo instances from using Strategy 1 to Strategy 3. Compared component that uses a state machine to handle incoming requests.
to Strategy 1, Strategy 3 achieves better efficiency and reduces the Client requests are uniformly assigned to nodes in the ring by a
size of membership information maintained at each node by three load balancer. Any Dynamo node can act as a coordinator for a
orders of magnitude. While storage is not a major issue the nodes read request. Write requests on the other hand will be coordinated
gossip the membership information periodically and as such it is by a node in the key’s current preference list. This restriction is
desirable to keep this information as compact as possible. In due to the fact that these preferred nodes have the added
addition to this, strategy 3 is advantageous and simpler to deploy responsibility of creating a new version stamp that causally
for the following reasons: (i) Faster bootstrapping/recovery: subsumes the version that has been updated by the write request.
Since partition ranges are fixed, they can be stored in separate Note that if Dynamo’s versioning scheme is based on physical
files, meaning a partition can be relocated as a unit by simply timestamps, any node can coordinate a write request.
transferring the file (avoiding random accesses needed to locate
specific items). This simplifies the process of bootstrapping and An alternative approach to request coordination is to move the
recovery. (ii) Ease of archival: Periodical archiving of the dataset state machine to the client nodes. In this scheme client
is a mandatory requirement for most of Amazon storage services. applications use a library to perform request coordination locally.
Archiving the entire dataset stored by Dynamo is simpler in A client periodically picks a random Dynamo node and
strategy 3 because the partition files can be archived separately. downloads its current view of Dynamo membership state. Using
By contrast, in Strategy 1, the tokens are chosen randomly and, this information the client can determine which set of nodes form
archiving the data stored in Dynamo requires retrieving the keys the preference list for any given key. Read requests can be
from individual nodes separately and is usually inefficient and coordinated at the client node thereby avoiding the extra network
slow. The disadvantage of strategy 3 is that changing the node hop that is incurred if the request were assigned to a random
membership requires coordination in order to preserve the Dynamo node by the load balancer. Writes will either be
properties required of the assignment. forwarded to a node in the key’s preference list or can be
Table 2: Performance of client-driven and server-driven shared across all background tasks. A feedback mechanism based
coordination approaches. on the monitored performance of the foreground tasks is
employed to change the number of slices that are available to the
99.9th 99.9th background tasks.
percentile percentile Average Average
read write read write The admission controller constantly monitors the behavior of
latency latency latency latency resource accesses while executing a "foreground" put/get
(ms) (ms) (ms) (ms) operation. Monitored aspects include latencies for disk operations,
Server- failed database accesses due to lock-contention and transaction
driven 68.9 68.5 3.9 4.02 timeouts, and request queue wait times. This information is used
Client- to check whether the percentiles of latencies (or failures) in a
driven 30.4 30.4 1.55 1.9 given trailing time window are close to a desired threshold. For
example, the background controller checks to see how close the
99th percentile database read latency (over the last 60 seconds) is
coordinated locally if Dynamo is using timestamps based to a preset threshold (say 50ms). The controller uses such
versioning. comparisons to assess the resource availability for the foreground
An important advantage of the client-driven coordination operations. Subsequently, it decides on how many time slices will
approach is that a load balancer is no longer required to uniformly be available to background tasks, thereby using the feedback loop
distribute client load. Fair load distribution is implicitly to limit the intrusiveness of the background activities. Note that a
guaranteed by the near uniform assignment of keys to the storage similar problem of managing background tasks has been studied
nodes. Obviously, the efficiency of this scheme is dependent on in [4].
how fresh the membership information is at the client. Currently
clients poll a random Dynamo node every 10 seconds for
6.6 Discussion
This section summarizes some of the experiences gained during
membership updates. A pull based approach was chosen over a
the process of implementation and maintenance of Dynamo.
push based one as the former scales better with large number of
Many Amazon internal services have used Dynamo for the past
clients and requires very little state to be maintained at servers
two years and it has provided significant levels of availability to
regarding clients. However, in the worst case the client can be
its applications. In particular, applications have received
exposed to stale membership for duration of 10 seconds. In case,
successful responses (without timing out) for 99.9995% of its
if the client detects its membership table is stale (for instance,
requests and no data loss event has occurred to date.
when some members are unreachable), it will immediately refresh
its membership information. Moreover, the primary advantage of Dynamo is that it provides
th the necessary knobs using the three parameters of (N,R,W) to tune
Table 2 shows the latency improvements at the 99.9 percentile
their instance based on their needs.. Unlike popular commercial
and averages that were observed for a period of 24 hours using
data stores, Dynamo exposes data consistency and reconciliation
client-driven coordination compared to the server-driven
logic issues to the developers. At the outset, one may expect the
approach. As seen in the table, the client-driven coordination
application logic to become more complex. However, historically,
approach reduces the latencies by at least 30 milliseconds for
Amazon’s platform is built for high availability and many
99.9th percentile latencies and decreases the average by 3 to 4
applications are designed to handle different failure modes and
milliseconds. The latency improvement is because the client-
inconsistencies that may arise. Hence, porting such applications to
driven approach eliminates the overhead of the load balancer and
use Dynamo was a relatively simple task. For new applications
the extra network hop that may be incurred when a request is
that want to use Dynamo, some analysis is required during the
assigned to a random node. As seen in the table, average latencies
initial stages of the development to pick the right conflict
tend to be significantly lower than latencies at the 99.9th
resolution mechanisms that meet the business case appropriately.
percentile. This is because Dynamo’s storage engine caches and
Finally, Dynamo adopts a full membership model where each
write buffer have good hit ratios. Moreover, since the load
node is aware of the data hosted by its peers. To do this, each
balancers and network introduce additional variability to the
node actively gossips the full routing table with other nodes in the
response time, the gain in response time is higher for the 99.9th
system. This model works well for a system that contains couple
percentile than the average.
of hundreds of nodes. However, scaling such a design to run with
6.5 Balancing background vs. foreground tens of thousands of nodes is not trivial because the overhead in
maintaining the routing table increases with the system size. This
tasks limitation might be overcome by introducing hierarchical
Each node performs different kinds of background tasks for extensions to Dynamo. Also, note that this problem is actively
replica synchronization and data handoff (either due to hinting or addressed by O(1) DHT systems(e.g., [14]).
adding/removing nodes) in addition to its normal foreground
put/get operations. In early production settings, these background 7. CONCLUSIONS
tasks triggered the problem of resource contention and affected This paper described Dynamo, a highly available and scalable
the performance of the regular put and get operations. Hence, it data store, used for storing state of a number of core services of
became necessary to ensure that background tasks ran only when Amazon.com’s e-commerce platform. Dynamo has provided the
the regular critical operations are not affected significantly. To desired levels of availability and performance and has been
this end, the background tasks were integrated with an admission successful in handling server failures, data center failures and
control mechanism. Each of the background tasks uses this network partitions. Dynamo is incrementally scalable and allows
controller to reserve runtime slices of the resource (e.g. database), service owners to scale up and down based on their current
request load. Dynamo allows service owners to customize their Principles of Distributed Computing (Newport, Rhode
storage system to meet their desired performance, durability and Island, United States). PODC '01. ACM Press, New York,
consistency SLAs by allowing them to tune the parameters N, R, NY, 170-179.
and W. [9] Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton,
The production use of Dynamo for the past year demonstrates that P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H.,
decentralized techniques can be combined to provide a single Wells, C., and Zhao, B. 2000. OceanStore: an architecture
highly-available system. Its success in one of the most for global-scale persistent storage. SIGARCH Comput.
challenging application environments shows that an eventual- Archit. News 28, 5 (Dec. 2000), 190-201.
consistent storage system can be a building block for highly- [10] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine,
available applications. M., and Lewin, D. 1997. Consistent hashing and random
trees: distributed caching protocols for relieving hot spots on
ACKNOWLEDGEMENTS the World Wide Web. In Proceedings of the Twenty-Ninth
The authors would like to thank Pat Helland for his contribution Annual ACM Symposium on theory of Computing (El Paso,
to the initial design of Dynamo. We would also like to thank Texas, United States, May 04 - 06, 1997). STOC '97. ACM
Marvin Theimer and Robert van Renesse for their comments. Press, New York, NY, 654-663.
Finally, we would like to thank our shepherd, Jeff Mogul, for his
detailed comments and inputs while preparing the camera ready [11] Lindsay, B.G., et. al., “Notes on Distributed Databases”,
version that vastly improved the quality of the paper. Research Report RJ2571(33471), IBM Research, July 1979
[12] Lamport, L. Time, clocks, and the ordering of events in a
REFERENCES distributed system. ACM Communications, 21(7), pp. 558-
[1] Adya, A., Bolosky, W. J., Castro, M., Cermak, G., Chaiken, 565, 1978.
R., Douceur, J. R., Howell, J., Lorch, J. R., Theimer, M., and
[13] Merkle, R. A digital signature based on a conventional
Wattenhofer, R. P. 2002. Farsite: federated, available, and
encryption function. Proceedings of CRYPTO, pages 369–
reliable storage for an incompletely trusted environment.
378. Springer-Verlag, 1988.
SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 1-14.
[14] Ramasubramanian, V., and Sirer, E. G. Beehive: O(1)lookup
[2] Bernstein, P.A., and Goodman, N. An algorithm for
performance for power-law query distributions in peer-to-
concurrency control and recovery in replicated distributed
peer overlays. In Proceedings of the 1st Conference on
databases. ACM Trans. on Database Systems, 9(4):596-615,
Symposium on Networked Systems Design and
December 1984
Implementation, San Francisco, CA, March 29 - 31, 2004.
[3] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach,
[15] Reiher, P., Heidemann, J., Ratner, D., Skinner, G., and
D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.
Popek, G. 1994. Resolving file conflicts in the Ficus file
E. 2006. Bigtable: a distributed storage system for structured
system. In Proceedings of the USENIX Summer 1994
data. In Proceedings of the 7th Conference on USENIX
Technical Conference on USENIX Summer 1994 Technical
Symposium on Operating Systems Design and
Conference - Volume 1 (Boston, Massachusetts, June 06 - 10,
Implementation - Volume 7 (Seattle, WA, November 06 - 08,
1994). USENIX Association, Berkeley, CA, 12-12..
2006). USENIX Association, Berkeley, CA, 15-15.
[16] Rowstron, A., and Druschel, P. Pastry: Scalable,
[4] Douceur, J. R. and Bolosky, W. J. 2000. Process-based
decentralized object location and routing for large-scale peer-
regulation of low-importance processes. SIGOPS Oper. Syst.
to-peer systems. Proceedings of Middleware, pages 329-350,
Rev. 34, 2 (Apr. 2000), 26-27.
November, 2001.
[5] Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., and
[17] Rowstron, A., and Druschel, P. Storage management and
Gauthier, P. 1997. Cluster-based scalable network services.
caching in PAST, a large-scale, persistent peer-to-peer
In Proceedings of the Sixteenth ACM Symposium on
storage utility. Proceedings of Symposium on Operating
Operating Systems Principles (Saint Malo, France, October
Systems Principles, October 2001.
05 - 08, 1997). W. M. Waite, Ed. SOSP '97. ACM Press,
New York, NY, 78-91. [18] Saito, Y., Frølund, S., Veitch, A., Merchant, A., and Spence,
S. 2004. FAB: building distributed enterprise disk arrays
[6] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google
from commodity components. SIGOPS Oper. Syst. Rev. 38, 5
file system. In Proceedings of the Nineteenth ACM
(Dec. 2004), 48-58.
Symposium on Operating Systems Principles (Bolton
Landing, NY, USA, October 19 - 22, 2003). SOSP '03. ACM [19] Satyanarayanan, M., Kistler, J.J., Siegel, E.H. Coda: A
Press, New York, NY, 29-43. Resilient Distributed File System. IEEE Workshop on
Workstation Operating Systems, Nov. 1987.
[7] Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The
dangers of replication and a solution. In Proceedings of the [20] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and
1996 ACM SIGMOD international Conference on Balakrishnan, H. 2001. Chord: A scalable peer-to-peer
Management of Data (Montreal, Quebec, Canada, June 04 - lookup service for internet applications. In Proceedings of
06, 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New the 2001 Conference on Applications, Technologies,
York, NY, 173-182. Architectures, and Protocols For Computer Communications
(San Diego, California, United States). SIGCOMM '01.
[8] Gupta, I., Chandra, T. D., and Goldszmidt, G. S. 2001. On
ACM Press, New York, NY, 149-160.
scalable and efficient distributed failure detectors. In
Proceedings of the Twentieth Annual ACM Symposium on
[21] Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J., [23] Weatherspoon, H., Eaton, P., Chun, B., and Kubiatowicz, J.
Spreitzer, M. J., and Hauser, C. H. 1995. Managing update 2007. Antiquity: exploiting a secure log for wide-area
conflicts in Bayou, a weakly connected replicated storage distributed storage. SIGOPS Oper. Syst. Rev. 41, 3 (Jun.
system. In Proceedings of the Fifteenth ACM Symposium on 2007), 371-384.
Operating Systems Principles (Copper Mountain, Colorado, [24] Welsh, M., Culler, D., and Brewer, E. 2001. SEDA: an
United States, December 03 - 06, 1995). M. B. Jones, Ed. architecture for well-conditioned, scalable internet services.
SOSP '95. ACM Press, New York, NY, 172-182. In Proceedings of the Eighteenth ACM Symposium on
[22] Thomas, R. H. A majority consensus approach to Operating Systems Principles (Banff, Alberta, Canada,
concurrency control for multiple copy databases. ACM October 21 - 24, 2001). SOSP '01. ACM Press, New York,
Transactions on Database Systems 4 (2): 180-209, 1979. NY, 230-243.

You might also like