Consistent Hashing and Random Trees:
Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
David Karger1 Eric Lehman1 Tom Leighton1;2 Matthew Levine1 Daniel Lewin1
Rina Panigrahy1
Abstract it was originally configured to handle. In fact, a site may receive so
many requests that it becomes “swamped,” which typically renders
We describe a family of caching protocols for distrib-uted networks it unusable. Besides making the one site inaccessible, heavy traffic
that can be used to decrease or eliminate the occurrence of hot spots destined to one location can congest the network near it, interfering
in the network. Our protocols are particularly designed for use with with traffic at nearby sites.
very large networks such as the Internet, where delays caused by As use of the Web has increased, so has the occurrence and
hot spots can be severe, and where it is not feasible for every server impact of hot spots. Recent famous examples of hot spots on the
to have complete information about the current state of the entire Web include the JPL site after the Shoemaker-Levy 9 comet struck
network. The protocols are easy to implement using existing net- Jupiter, an IBM site during the Deep Blue-Kasparov chess tour-
work protocols such as TCP/IP, and require very little overhead. nament, and several political sites on the night of the election. In
The protocols work with local control, make efficient use of exist- some of these cases, users were denied access to a site for hours or
ing resources, and scale gracefully as the network grows. even days. Other examples include sites identified as “Web-site-of-
Our caching protocols are based on a special kind of hashing the-day” and sites that provide new versions of popular software.
that we call consistent hashing. Roughly speaking, a consistent Our work was originally motivated by the problem of hot spots
hash function is one which changes minimally as the range of the on the World Wide Web. We believe the tools we develop may be
function changes. Through the development of good consistent relevant to many client-server models, because centralized servers
hash functions, we are able to develop caching protocols which do on the Internet such as Domain Name servers, Multicast servers,
not require users to have a current or even consistent view of the and Content Label servers are also susceptible to hot spots.
network. We believe that consistent hash functions may eventually
prove to be useful in other applications such as distributed name
servers and/or quorum systems.
1.1 Past Work
Several approaches to overcoming the hot spots have been pro-
1 Introduction posed. Most use some kind of replication strategy to store copies of
hot pages throughout the Internet; this spreads the work of serving
In this paper, we describe caching protocols for distributed net- a hot page across several servers. In one approach, already in wide
works that can be used to decrease or eliminate the occurrences use, several clients share a proxy cache. All user requests are for-
of “hot spots”. Hot spots occur any time a large number of clients warded through the proxy, which tries to keep copies of frequently
wish to simultaneously access data from a single server. If the site requested pages. It tries to satisfy requests with a cached copy; fail-
is not provisioned to deal with all of these clients simultaneously, ing this, it forwards the request to the home server. The dilemma
service may be degraded or lost. in this scheme is that there is more benefit if more users share the
Many of us have experienced the hot spot phenomenon in the same cache, but then the cache itself is liable to get swamped.
context of the Web. A Web site can suddenly become extremely Malpani et al. [6] work around this problem by making a group
popular and receive far more requests in a relatively short time than of caches function as one. A user's request for a page is directed
to an arbitrary cache. If the page is stored there, it is returned to
This research was supported in part by DARPA contracts N00014-95- the user. Otherwise, the cache forwards the request to all other
1-1246 and DABT63-95-C-0009, Army Contract DAAH04-95-1-0607, and caches via a special protocol called “IP Multicast”. If the page is
NSF contract CCR-9624239 cached nowhere, the request is forwarded to the home site of the
1 Laboratory for Computer Science, MIT, Cambridge, MA 02139.
page. The disadvantage of this technique is that as the number
email: fkarger,e lehman,danl,ftl,mslevine,danl,rinapg@[Link]
of participating caches grows, even with the use of multicast, the
A full version of this paper is availble at:
[Link] fkarger,e lehman,ftl,mslevine,danl,rinapg number of messages between caches can become unmanageable.
2 Department of Mathematics, MIT, Cambridge, MA 02139 A tool that we develop in this paper, consistent hashing, gives a
way to implement such a distributed cache without requiring that
the caches communicate all the time. We discuss this in Section 4.
Chankhunthod et al. [1] developed the Harvest Cache, a more
scalable approach using a tree of caches. A user obtains a page by
asking a nearby leaf cache. If neither this cache nor its siblings have
the page, the request is forwarded to the cache's parent. If a page
is stored by no cache in the tree, the request eventually reaches
the root and is forwarded to the home site of the page. A cache
retains a copy of any page it obtains for some time. The advantage added delay seen by a user is small.
of a cache tree is that a cache receives page requests only from its Our second tool is a new hashing scheme we call consistent
children (and siblings), ensuring that not too many requests arrive hashing. This hashing scheme differs substantially from that used
simultaneously. Thus, many requests for a page in a short period in Plaxton/Rajaraman and other practical systems. Typical hashing
of time will only cause one request to the home server of the page, based schemes do a good job of spreading load through a known,
and won' t overload the caches either. A disadvantage, at least in fixed collection of servers. The Internet, however, does not have
theory, is that the same tree is used for all pages, meaning that the a fixed collection of machines. Instead, machines come and go
root receives at least one request for every distinct page requested as they crash or are brought into the network. Even worse, the
of the entire cache tree. This can swamp the root if the number of information about what machines are functional propagates slowly
distinct page requests grows too large, meaning that this scheme through the network, so that clients may have incompatible “views”
also suffers from potential scaling problems. of which machines are available to replicate data. This makes stan-
Plaxton and Rajaraman [9] show how to balance the load dard hashing useless since it relies on clients agreeing on which
among all caches by using randomization and hashing. In partic- caches are responsible for serving a particular page. For example,
ular, they use a hierarchy of progressively larger sets of “virtual Feeley et al [3] implement a distributed global shared memory sys-
cache sites” for each page and use a random hash function to as- tem for a network of workstations that uses a hash table distributed
sign responsibility for each virtual site to an actual cache in the among the machines to resolve references. Each time a new ma-
network. Clients send a request to a random element in each set in chine joins the network, the require a central server to redistribute
the hierarchy. Caches assigned to a given set copy the page to some a completely updated hash table to all the machines.
members of the next, larger set when they discover that their load Consistent hashing may help solve such problems. Like most
is too heavy. This gives fast responses even for popular pages, be- hashing schemes, consistent hashing assigns a set of items to buck-
cause the largest set that has the page is not overloaded. It also gives ets so that each bin receives roughly the same number of items.
good load balancing, because a machine in a small (thus loaded) set Unlike standard hashing schemes, a small change in the bucket set
for one page is likely to be in a large (thus unloaded) set for another. does not induce a total remapping of items to buckets. In addi-
Plaxton and Rajaraman' s technique is also fault tolerant. tion, hashing items into slightly different sets of buckets gives only
The Plaxton/Rajaraman algorithm has drawbacks, however. For slightly different assignments of items to buckets. We apply con-
example, since their algorithm sends a copy of each page request sistent hashing to our tree-of-caches scheme, and show how this
to a random element in every set, the small sets for a popular page makes the scheme work well even if each client is aware of only
are guaranteed to be swamped. In fact, the algorithm uses swamp- a constant fraction of all the caching machines. In [5] Litwin et al
ing as a feature since swamping is used to trigger replication. This proposes a hash function that allows buckets to be added one at a
works well in their model of a synchronous parallel system, where time sequentially. However our hash function allows the buckets to
a swamped processor is assumed to receive a subset of the incom- be added in an arbitrary order. Another scheme that we can improve
ing messages, but otherwise continues to function normally. On the on is given by Devine [2]. In addition, we believe that consistent
Internet, however, swamping has much more serious consequences. hashing will be useful in other applications (such as quorum sys-
Swamped machines cannot be relied upon to recover quickly and tems [7] [8] or distributed name servers) where multiple machines
may even crash. Moreover, the intentional swamping of large num- with different views of the network must agree on a common stor-
bers of random machines could well be viewed unfavorably by the age location for an object without communication.
owners of those machines. The Plaxton/Rajaraman algorithm also
requires that all communications be synchronous and/or that mes- 1.3 Presentation
sages have priorities, and that the set of caches available be fixed
and known to all users. In Section 2 we describe our model of the Web and the hot spot
problem. Our model is necessarily simplistic, but is rich enough
1.2 Our Contribution to develop and analyze protocols that we believe may be useful in
practice. In Section 3, we describe our random tree method and use
Here, we describe two tools for data replication and use them to it in a caching protocol that effectively eliminates hot spots under a
give a caching algorithm that overcomes the drawbacks of the pre- simplified model. Independent of Section 3, in Section 4 we present
ceding approaches and has several additional, desirable properties. our consistent hashing method and use it to solve hot spots under a
Our first tool, random cache trees, combines aspects of the different simplified model involving inconsistent views.
structures used by Chankhunthod et al. and Plaxton/Rajaraman. In Section 5 we show how our two techniques can be effectively
Like Chankhunthod et al., we use a tree of caches to coalesce re- combined. In Section 6 we propose a simple delay model that cap-
quests. Like Plaxton and Rajaraman, we balance load by using tures hierarchical clustering of machines on the Internet. We show
a different tree for each page and assigning tree nodes to caches that our protocol can be easily extended to work in this more real-
via a random hash function. By combining the best features of istic delay model. In Sections 7 and 8 we consider faults and the
Chankhunthod et al. and Plaxton/Rajaraman with our own meth- behavior of the protocol over time, respectively. In Section 9 we
ods, we prevent any server from becoming swamped with high discuss some extensions and open problems.
probability, a property not possessed by either Chankhunthod et
al. or Plaxton/Rajaraman. In addition, our protocol shows how to 1.4 A Note on Randomization and Hashing
minimize memory requirements (without significantly increasing
cache miss rates) by only caching pages that have been requested a In several places we make use of hash functions that map objects
sufficient number of times. into a range. For clarity we assume that these functions map objects
We believe that the extra delay introduced by a tree of caches in a truly random fashion, i.e. uniformly and independently. In
should be quite small in practice. The time to request a page is practice, hash functions with limited independence are more plau-
multiplied by the tree depth. However, the page request typically sible since they economize on space and randomness. We have
takes so little time that the extra delay is not great. The return of proven all theorems of this paper with only limited independence
a page can be pipelined; a cache need not wait until it receives a using methods similar to those in [11]. However, in this extended
whole page before sending data to its child in the tree. Therefore, abstract we only state the degree of independence required for re-
the return of a page also takes only slightly longer. Altogether, the sults to hold. Proofs assuming limited independence will appear in
the full version of this paper. 3 Random Trees
2 Model In this section we introduce our first tool, random trees. To sim-
plify the presentation, we give a simple caching protocol that would
This section presents our model of the Web and the hotspot prob- work well in a simpler world. In particular, we make the following
lem. simplifications to the model:
We classify computers on the Web into three categories. All 1. All machines know about all caches.
requests for Web pages are initiated by browsers. The permanent
homes of Web pages are servers. Caches are extra machines which 2. (mi ; mj ) = 1 for all i 6= j .
we use to protect servers from the barrage of browser requests.
C
3. All requests are made at the same time.
Throughout the paper, the set of caches is and the number of
caches is C . This restricted model is “static” in the sense that there is only
Each server is home to a fixed set of pages. Caches are also one batch of requests; we need not consider the long-term stability
able to store a number of pages, but this set may change over time of the network.
as dictated by a caching protocol. We generally assume that the Under these restrictions we show a protocol that has good be-
content of each page is unchanging, though Section 9 contains a havior. That is, with high probability no machine is swamped. We
P
discussion of this issue. The set of all pages is denoted . achieve a total delay of (log C ) and prove that it is optimal. We
Any machine can send a message directly to any other with use total cache space which is a fraction of the number of requests,
the restriction that a machine may not be aware of the existence and evenly divided among the caches. In subsequent sections we
of all caches; we require only that each machine is aware of a 1=t will show how to extend the protocol so as to preserve the good
fraction of the caches for some constant t. The two typical types behavior without the simplifying assumptions.
of messages are requests for pages and the pages themselves. A The basic idea of our protocol is an extension of the “tree of
machine which receives too many messages too quickly ceases to caches” approach discussed in the introduction. We use this tree to
function properly and is said to be “swamped”. ensure that no cache has many “children” asking it for a particu-
Latency measures the time for a message from machine m1 lar page. As discussed in the introduction, levels near the root get
to arrive at machine m2 . We denote this quantity (m1 ; m2 ). In many requests for a page even if the page is relatively unpopular,
practice, of course, delays on the Internet are not so simply char- so being the root for many pages causes swamping. Our technique,
acterized. The value of should be regarded as a “best guess” that similar to Plaxton/Rajaraman' s, is to use a different, randomly gen-
we optimize on for lack of better information; the correctness of erated tree for each page. This ensures that no machine is near the
a protocol should not depend on values of (which could actually root for many pages, thus providing good load balancing. Note that
measure anything such as throughput, price of connection or con- we cannot make use of the analysis given by Plaxton/Rajaraman,
gestion) being exactly accurate. Note that we do not make latency because our main concern is to prevent swamping, whereas they
a function of message size; this issue is discussed in Section 3.2.1. allow machines to be swamped.
All cache and server behavior and some browser behavior is In Section 3.1 below, we define our protocol precisely. In Sec-
specified in our protocol. In particular, the protocol specifies how tion 3.2, we analyze the protocol, bounding the load on any cache,
caches and servers respond to page requests and which pages are the storage each cache uses, and the delay a browser experiences
stored in a cache. The protocol also specifies the cache or server before getting the page.
to which a browser sends each page request. All control must be
local; the behavior of a machine can depend only on messages it 3.1 Protocol
We associate a rooted d-ary tree, called an abstract tree, with each
receives.
An adversary decides which pages are requested by browsers.
However, the adversary cannot see random values generated in our page. We use the term nodes only in reference to the nodes of these
protocol and cannot adapt his requests based on observed delays abstract trees. The number of nodes in each tree is equal to the
in obtaining pages. We consider two models. First, we consider a number of caches, and the tree is as balanced as possible (so all
static model in which a single “batch” of requests is processed, and levels but the bottom are full). We refer to nodes of the tree by
require that the number of page requests be at most R = C where their rank in breadth-first search order. The protocol is described
is a constant and C is the number of caches. We then consider as running on these abstract trees; to support this, all requests for
pages take the form of a 4-tuple consisting of the identity of the re-
a temporal model in which the adversary may initiate new requests
for pages at rate at most ; that is, in any time interval > 1, he quester, the name of the desired page, a sequence of nodes through
may initiate at most requests. which the request should be directed, and a sequence of caches that
should act as those nodes. To determine the latter sequence, that is,
which cache actually does the work for a given node, the nodes are
Objective mapped to machines. The root of a tree is always mapped to the
server for the page. All the other nodes are mapped to the caches
The “hot spot problem” is to satisfy all browser page requests while
ensuring that with high probability no cache or server is swamped. by a hash function h : P !C
[1 : : : C ] , which must be dis-
The phrase “with high probability” means “with probability at least tributed to all browsers and caches. In order not to create copies of
1 1=N ”, where N is a confidence parameter used throughout the pages for which there are few requests, we have another parameter,
paper. q, for how many requests a cache must see before it bothers to store
While our basic requirement is to prevent swamping, we also a copy of the page.
have two additional objectives. The first is to minimize cache mem- Now, given a hash function h, and parameters d and q , our pro-
ory requirements. A protocol should work well without requiring tocol is as follows:
any cache to store a large number of pages. A second objective is, Browser When a browser wants a page, it picks a random leaf to
naturally, to minimize the delay a browser experiences in obtaining root path, maps the nodes to machines with h, and asks the
a page. leaf node for the page. The request includes the name of the
browser, the name of the page, the path, and the result of the
mapping.
Cache When a cache receives a request, it first checks to see if it We will now analyze our protocol under the simplified model.
is caching a copy of the page or is in the process of getting In this “static” analysis we assume for now that caches have enough
one to cache. If so, it returns the page to the requester (after it space that they never have to evict pages; this means that if a cache
gets its copy, if necessary). Otherwise it increments a counter has already made q requests for a page it will not make another
for the page and the node it is acting as, and asks the next request for the same page. In Theorem 3.1 we provide high proba-
machine on the path for the page. If the counter reaches q , it bility bounds on the number of requests a cache gets, assuming that
caches a copy of the page. In either case the cache passes the all the outputs of the function h are independent and random. The-
page on to the requester when it is obtained. orem 3.4 extends our high probability analysis to the case when h
is a k-way independent function. In particular we show that it suf-
Server When a server receives a request, it sends the requester a fices to have k logarithmic in the system parameters to achieve the
copy of the page. same high probability bounds as with full independence.
3.2 Analysis Analysis for Random h
The analysis is broken into three parts. We begin by showing that Theorem 3.1 If h is chosen uniformly and at random from the
the latency in processing a request is likely to be small, under the space of functions P [1 : : : C ] 7! C then with probability at least
assumption that no server is swamped. We then show that no ma-
chine is likely to be swamped. We conclude by showing that no
1 1=N , where N is a parameter, the number of requests a given
cache gets is no more than
cache need store too many pages for the protocol to work properly.
The analysis of swamping runs much the same way, except that !
the “weights” on our abstract nodes are now the number of requests 2 logd C + O logloglogNN +O dq log N + log N
arriving at those nodes. As above, the number of requests that hit a log dq log N
machine is bounded by the weight of nodes mapped to it.
Note that logd C is the average number of requests per cache
3.2.1 Latency since each browser request could give rise to logd C requests up the
log N term arises because at the leaf nodes of a tree' s
trees. The log log N
page some cache could occur logloglogNN times (balls-in-bins) and the
Under our protocol, the delay a browser experiences in obtaining
a page is determined by the height of the tree. If a request is for-
warded from a leaf to the root, the latency is twice the length of adversary could choose to devote all R requests to that page. We
the path, 2 logd C . If the request is satisfied with a cached copy, prove the above Theorem in the rest of the section.
the latency is only less. If a request stops at a cache that is wait- We split the analysis into two parts. First we analyze the re-
ing for a cache copy, the latency is still less since a request has quests to a cache due to its presence in the leaf nodes of the ab-
already started up the tree. Note that d can probably be made large stract trees and then analyze the requests due to its presence at the
in practice, so this latency will be quite small. internal nodes and then add them up.
Note that in practice, the time required to obtain a large page is
not multiplied by the number of steps in a path over which it travels. Requests to Leaf Nodes
The reason is that the page can be transmitted along the path in
a pipelined fashion. A cache in the middle of the path can start Due to space limitations, we give a proof that only applies when
sending data to the next cache as soon as it receives some; it need
N R. Its extension to small N is straightforward but long.
not wait to receive the whole page. This means that although this Observe that the requests for each page are being mapped randomly
protocol will increase the delay in getting small pages, the overhead onto the leaf nodes of its abstract tree. And then these leaf nodes
for large pages is negligible. The existence of tree schemes, like the are mapped randomly onto the set of caches. Look at collection of
Harvest Cache, suggests that is acceptable in practice. all the leaf nodes and the number of requests (weight) associated
Our bound is optimal (up to constant factors) for any protocol with each one of them. The variance among the “weights” of the
that forbids swamping. To see this, consider making C requests leaf nodes is maximized when all the requests are made for one
for a single page. Look at the graph with nodes corresponding to page. This is also the case which maximizes the number of leaf
machines and edges corresponding to links over which the page node requests on a cache.
is sent. Small latency implies that this graph has small diameter, Each page's tree has about C (1 1=d) leaf nodes. Since a
which implies that some node must have high degree, which im- machine m has a 1=C chance of occurring at a particular leaf node,
plies swamping. with probability 1 1=N it will occur in O( logloglogNN ) leaf nodes. In
fact, since there are at most R requests, m will occur O( logloglogNN )
3.2.2 Swamping times in all those requested pages' trees with probability 1 R=N .
Given an assignment of machines to leaf nodes so that m occurs
O( logloglogNN ) times in each tree, the expected number of requests m
The intuition behind our analysis is the following. First we analyze
the number of requests directed to the abstract tree nodes of various
pages. These give “weights” to the tree nodes. We then analyze the log N ). Also, once the as-
gets is R C1 O( logloglogNN ) which is O( log log N
outcome when the tree nodes are mapped by a hash function onto signment of machine to leaf nodes is fixed, the number of requests
the actual caching machines: a machine gets as many requests as m gets is a sum of independent Bernoulli variables. So by Cher-
the total weight of nodes mapped to it. To bound the projected noff bounds m gets O( log log N + log N ) requests with probability
log N
weight, we first give a bound for the case where each node is as-
signed to a random machine. This is a weighted version of the 1 1=N . So we conclude that m gets O( log loglogNN + log N ) with
familiar balls-in-bins type of analysis. Our analysis gives a bound probability at least 1 (R +1)=N . Replacing N by N 2 and assum-
with an exponential tail. We can therefore argue as in [11] that it ap- ing N > R we can say that the same bound holds with probability
plies even when the balls are assigned to bins only k = O(log N )- 1 1=N . It is easy to extend this proof so that the bound holds
way independently. This can be achieved by using a k-universal even for N < R.
hash function to map the abstract tree nodes to machines.
Requests to Internal Nodes Proof: Full paper.
Again we think of the protocol as first running on the abstract trees.
Now no abstract internal node gets more than dq requests because Analysis for k-way Independent h We now extend our high
each child node gives out at most q requests for a page. Consider probability analysis to functions h that are chosen at random from
any arbitrary arrangement of paths for all the R requests up their a k-universal hash family.
respective trees. Since there are only R requests in all we can bound
the number of abstract nodes that get dq requests. In fact we will Theorem 3.4 If h is chosen at random from a k-universal hash
bound the number of abstract nodes over all trees which receive family then with probability at least 1 1=N a given cache receives
between 2j and 2j +1 requests where 0
qj log dq 1. Let no more than 2 logd C + O((kqN (d + ))1=k (1+ k + log(kdq=
dq +
nj denote the number of abstract nodes that receive between 2j )) requests. )
and 2jP
+1 requests. Let rp be the number of requests for page p. log k
Then
rp R. Since each of the R requests gives rise to at
most logd C requests up the trees, the total number of requests is Proof: The full proof is deferred to the final version of the paper.
no more than R logd C . So, This result does not follow immediately from the results of [11],
but involves a similar argument.
dq) 1
log(X
Setting k = log N we get the following corollary.
2j nj R logd C (1)
j =0 Corollary 3.5 The high probability bound proved in theorem 3.1
for the number of requests a cache gets holds even if h is selected
from a log N -universal hash family.
Lemma 3.2 The total number of internal nodes which receive at
least qx requests is at most 2R=x if x > 1
In fact, this can be shown to be true for all the bounds that we will
prove later, i.e., it suffices k to be logarithmic in the system size.
Proof (sketch): Look at the tree induced by the request paths, con-
tract out degree 1 nodes, and count internal nodes.
For x = 1 there can clearly be no more than R log d C requests.
The preceding lemma tells us that nj , the number of abstract nodes
3.2.3 Storage
that receive between 2j and 2j +1 requests, is at most 22R
j except for In this section, we discuss the amount of storage each cache must
j = 0. For j = 0, nj will be at most R logd C . Now the probabil- have in order to make our protocol work. The amount of storage
ity that machine m assumes a given one of these nj nodes is 1=C . required at a cache is simply the number of pages for which it re-
Since assignments of nodes to machines are independent the prob- ceives more than q requests.
a machine m is receives
ability that more than z of these nodes is at
most nzj (1=C )z (enj =Cz)z . In order for the right hand side
Lemma 3.6 The total number of cached pages, over all machines,
n N is O(log R logd C + R (1)
q) with probability at least 1 1=R . A
to be as small as 1=N we must have z = ( Cj + log( log C log N ) ).
nj given cache m has O( q + log R) cached copies with high proba-
Note that the latter term will be present only if nCj log N > 2. So bility.
z is O( nCj + log( log N
C log N ) ) with probability at least 1
nj
1=N . Proof (sketch): The analysis is very similar to that in proof of The-
So with probability at least 1 log(dq )=N the total number of orem 3.1. We again play the protocol on the abstract trees. Since a
requests received by m due to internal nodes will be of the order of page is cached only if it requested q times, we assign each abstract
node a weight of one if it gets more than q requests and zero other-
! wise. These abstract nodes are then mapped randomly onto the set
dq) 1
log(X
2j+1 nj + log N of caches. We can bound the total weight received by a particular
j =0
C log( nCj log N ) cache, which is exactly the number of pages it caches.
!
= 2 logd C + O dq log N + log N 4 Consistent Hashing
log( dq log N )
In this section we define a new hashing technique called consis-
By combining the high probability bounds for internal and leaf tent hashing. We motivate this technique by reference to a simple
nodes, we can say that a machine gets scheme for data replication on the Internet. Consider a single server
! that has a large number of objects that other clients might want to
2 logd C + O log N +O dq log N + log N
access. It is natural to introduce a layer of caches between the
log log N dq
log log N clients and the server in order to reduce the load on the server. In
such a scheme, the objects should be distributed across the caches,
requests with probability at least 1 O( logNdq ). Replacing N by
so that each is responsible for a roughly equal share. In addition,
N log(dq) and ignoring log log(dq) in comparision with dq we get clients need to know which cache to query for a specific object.
The obvious approach is hashing. The server can use a hash func-
Theorem 3.1. tion that evenly distributes the objects across the caches. Clients
can use the hash function to discover which cache stores a object.
Tightness of the high probability bound In this section we Consider now what happens when the set of active caching ma-
show that the high probability bound we have proven for the num- chines changes, or when each client is aware of a different set of
ber of requests received by a machine m is tight. caches. (Such situations are very plausible on the Internet.) If the
Lemma 3.3 There exists a distribution of R requests to pages distribution was done with a classical hash function (for example,
so that a given machine m gets ( log d C + logloglogNN +
the linear congruential function x 7! ax + b (mod p)), such in-
consistencies would be catastrophic. When the range of the hash
dq log N ) requests with probability at least 1=N . function (p in the example) changed, almost every item would be
log dq
log N
hashed to a new location. Suddenly, all cached data is useless be- ranged hash family is monotone if every ranged hash function in it
cause clients are looking for it in a different location. is.
Consistent hashing solves this problem of different “views.” We
This property says that if items are initially assigned to a set
define a view to be the set of caches of which a particular client is
aware. We assume that while views can be inconsistent, they are V
of buckets 1 and then some new buckets are added to form 2 , V
substantial: each machine is aware of a constant fraction of the cur- then an item may move from an old bucket to a new bucket, but not
rently operating caches. A client uses a consistent hash function from one old bucket to another. This reflects one intuition about
to map a object to one of the caches in its view. We analyze and consistency: when the set of usable buckets changes, items should
construct hash functions with the following consistency properties. only move if necessary to preserve an even distribution.
First, there is a “smoothness” property. When a machine is added Spread: Let V1 : : : VV be a set of views, altogether containing
to or removed from the set of caches, the expected fraction of ob- C distinct buckets and each individually containing at least C=t
jects that must be moved to a new cache is the minimum needed buckets. For a ranged hash function and a particular item i, the
to maintain a balanced load across the caches. Second, over all the spread (i) is the quantity ffVj (i)gVj=1 . The spread of a hash
function (f ) is the maximum spread of an item. The spread of
client views, the total number of different caches to which a object
a hash family is if with high probability, the spread of a random
is assigned is small. We call this property “spread”. Similarly, over
hash function from the family is .
all the client views, the number of distinct objects assigned to a
particular cache is small. We call this property “load”.
Consistent hashing therefore solves the problems discussed The idea behind spread is that there are V people, each of whom
above. The “spread” property implies that even in the presence can see at least a constant fraction (1=t) of the buckets that are
of inconsistent views of the world, references for a given object are visible to anyone. Each person tries to assign an item i to a bucket
directed only to a small number of caching machines. Distributing using a consistent hash function. The property says that across the
a object to this small set of caches will insure access for all clients, entire group, there are at most (i) different opinions about which
without using a lot of storage. The “load” property implies that bucket should contain the item. Clearly, a good consistent hash
no one cache is assigned an unreasonable number of objects. The function should have low spread over all items.
Load: Define a set of V views as before. For aS
“smoothness” property implies that smooth changes in the set of
caching machines are matched by a smooth evolution in the loca- ranged hash
function f and bucket b, the load (b) is the quantity 1
tion of cached objects. V fV (b) .
Since there are many ways to formalize the notion of consis- The load of a hash function is the maximum load of a bucket. The
tency as described above, we will not commit to a precise defini- load of a hash family is if with high probability, a randomly cho-
tion. Rather, in Section 4.4 we define a “ranged hash function” and sen hash function has load . (Note that fV 1 (b) is the set of items
then precisely define several quantities that capture different as- V
assigned to bucket b in view .) The load property is similar to
pects of “consistency”. In Section 4.2 we construct practical hash spread. The same V people are back, but this time we consider a
functions which exhibit all four to some extent. In Section 4.4, we particular bucket b instead of an item. The property says that there
discuss other aspects of consistent hashing whihc, though not ger- are at most (b) distinct items that at least one person thinks be-
mane to this paper, indicate some of the richness underlying the longs in the bucket. A good consistent hash function should also
theory. have low load.
Our main result for consistent hashing is Theorem 4.1 which
4.1 De nitions shows the existence of an efficiently computable monotonic ranged
hash family with logarithmic spread and balance.
In this section, we formalize and relate four notions of consistency.
I B
Let be the set of items and be the set of buckets. Let I = jIj
be the number of items. A view is any subset of the buckets . B 4.2 Construction
A ranged hash function is a function of the form f : 2B I 7!
B . Such a function specifies an assignment of items to buckets
We now give a construction of a ranged hash family with good
properties. Suppose that we have two random functions rB and rI .
V
for every possible view. That is, f ( ; i) is the bucket to which The function rB maps buckets randomly to the unit interval, and rI
V
item i is assigned in view . (We will use the notation fV (i) in does the same for items. fV (i) is defined to be the bucket b 2V
V
place f ( ; i) from now on.) Since items should only be assigned j j
that minimizes rB (b) rI (i) . In other words, i is mapped to the
to usable buckets, we require fV ( ) I V for every view . V bucket “closest” to i. For reasons that will become apparent, we ac-
A ranged hash family is a family of ranged hash functions. A tually need to have more than one point in the unit interval associ-
random ranged hash function is a function drawn at random from ated with each bucket. Assuming that the number of buckets in the
a particular ranged hash family. range is always less than C , we will need log(C ) points for each
In the remainder of this section, we state and relate some rea- bucket for some constant . The easiest way to view this is that
sonable notions of consistency regarding ranged hash families. each bucket is replicated log(C ) times, and then rB maps each
Throughout, we use the following notational conventions: F
is a
V
ranged hash family, f is a ranged hash function, is a view, i is an
replicated bucket randomly. In order to economize on the space to
item, and b is a bucket.
represent a function in the family, and on the use of random bits,
we only demand that the functions rB and rI map points log(C )-
Balance: A ranged hash family is balanced if, given a particu- way independently and uniformly to [0; 1]. Note that for each point
V
lar view a set of items, and a randomly chosen function selected we pick in the unit interval, we need only pick enough random bits
from the hash family, with high probability the fraction of items to distinguish the point from all other points. Thus it is unlikely
j j
mapped to each bucket is O(1= V ). that we need more than log(number of points) bits for each point.
The balance property is what is prized about standard hash
Denote the above described hash family as . F
functions: they distribute items among buckets in a balanced fa- Theorem 4.1 The ranged hash family F described above has the
sion. following properties:
Monotonicity: A ranged hash function f is monotone if for all
views 1V V B
2 , fV2 (i) 2V
1 implies fV1 (i) = fV2 (i). A
1. F is monotone.
V
2. Balance: For a fixed view , Pr[fV (i) = b] O(1) for
jVj as points are added, we bisect segments gradually so that when we
i 2I and b 2V , and, conditioned on the choice of rB , the reach the next power of 2, we have already divided all the segments.
assignments of items to buckets are log(C )-way independent. In this way we amortize the work of dividing search trees over all of
the additions and removals. Another point is that the search trees in
3. Spread: If the number of views V = C for some constant adjacent empty intervals may all need to be updated when a bucket
2I
, and the number of items I = C , then for i , (i) is is added since they may all now be closest to that bucket. Since the
O(t log(C )) with probability greater than 1 1=C (1) . expected length of a run of empty intervals is small, the additional
V and I are as above, then for b 2 B, (b) is
cost is negligible. For a more complete analysis of the running time
4. Load: If we refer to the complete version of the paper.
O(t log(C )) with probability greater than 1 1=C (1) .
Proof (sketch): Monotonicity is immediate. When a new bucket 4.4 Some Theorems on Consistent Hashing
is added, the only items that move are those that are now closest to In this section, we discuss some additional features of consistent
one of the new bucket' s associated points. No items move between hashing which, though unneccessary for the remainder of the paper,
old buckets. The spread and load properties follow from the obser- demonstrate some of its interesting properties.
vation that with high probability, a point from every view falls into To give insight into the monotone property, we will define a
an interval of length O(t=C ). Spread follows by observing that new class of hash functions and then show that this is equivalent to
the number of bucket points that fall in this size interval around the class of monotone ranged hash functions.
an item point is an upper bound on the spread of that item, since A -hash function is a hash function of the familiar form
no other bucket can be closer in any view. Standard Chernoff argu- f : 2B I 7! B constructed as follows. With each item i 2I,
ments apply to this case. Load follows by a similar argument where B
associate a permutation (i) of all the buckets . Define fV (i) to
we count the number of item points that fall in the region “owned” be the first bucket in the permutation (i) that is contained in the
by a bucket' s associated points. Balance follows from the fact that
when log(C ) points are randomly mapped to the unit interval,
V
view . Note that the permutations need not be chosen uniformly
or independently.
each bucket is with highu probability responsible for no more than
a OjVj
(1) fraction of the interval. The key here is to count the number Theorem 4.3 Every monotone ranged hash function is a -hash
of combinatroially distinct ways of assigning this large a fraction to function and vice versa.
the log(C ) points associated with a bucket. This turns out to be
polynomial in C . We then argue that with high probability none of Proof (sketch): For a ranged hash function f , associate item i with
these possibilities could actually occur by showing that in each one the permutation b1 = fB (i); : : : ; bj +1 = fB fb1 :::bj g (i); : : : .
an additional bucket point is likely to fall. We deduce that the ac- V
Suppose bj is the first element of an arbitrary view 1 in this
jVj
tual length must be smaller than O(1= ). All of the above proofs permutation. Then 1 V V 2 = B f g
b1 : : : bj 1 . Since
can be done with only log(C )-way independent mappings. 2V
fV2 (i) = bj 1 , monotonicity implies fV1 (i) = bj .
The following corollary is immediate and is useful in the rest
of the paper. The equivalence stated in Theorem 4.3 allows us to reason
about monotonic ranged hash functions in terms of permutations
Corollary 4.2 With the same conditions of the previous theorem, associated with items.
Pr[fV (i) = b in any view] O(t log(
C ))
jVj for i 2I
and b . 2B Universality: A ranged hash family is universal if restricting
every function in the family to a single view creates a universal
4.3 Implementation hash family.
In this section we show how the hash family just dexcrobed can This property is one way of requiring that a ranged hash func-
be implemented efficiently. Specifically, the expected running time tion be well-behaved in every view. The above condition is rather
for a single hash computation will be O(1). The expectation is over stringent; it says that if a view is fixed, items are assigned ran-
the choice of hash function. The expected running time for adding domly to the bins in that view. This implies that in any view , V
or deleting a bucket will be O(log C ) where C is an upper bound the expected fraction of items assigned to j of the buckets is j= .jVj
on the total number of buckets in all views. Using only monotonicity and this fact about the uniformity of the
A simple implementation uses a balanced binary search tree assignment, we can determine the expected number of items reas-
to store the correspondence between segments of the unit interval signed when the set of usable buckets changes. This relates to the
and buckets. If there are C buckets, then there will be C log(C ) informal notion of “smoothness”.
intervals, so the search tree will have depth O(log(C )). Thus, a
single hash computation takes O(log(C )) time. The time for an Theorem 4.4 Let f be a monotonic, universal ranged hash func-
addition or removal of a bucket is O(log 2 (C )) since we insert or V V
tion. Let 1 and 2 be views. The expected fraction of items i for
delete log(C ) points for each bucket. which fV1 (i) = fV2 (i) is jV1 \V2 j
jV1 [V2 j .
The following trick reduces the expected running time of a hash
computation to O(1). The idea is to divide the interval into roughly Proof (sketch): Count the number of items that move as we add
C log(C ) equal length segments, and to keep a separate search V
buckets from 1 until the view is 1 V [V
2 , and then delete buckets
tree for each segment. Thus, the time to compute the hash function down to 2V
is the time to determine which interval rI (i) is in, plus the time
to lookup the bucket in the corresponding search tree. The first
time is always O(1). Since, the expected number of points in each Note that monotonicity is used only to show an upper bound on
segment is O(1), the second time is O(1) in expectation. the number of items reassigned to a new bucket; this implies that
One caveat to the above is that as the number of buckets grows, one can not obtain a “more consistent” universal hash function by
the size of the subintervals needs to shrink. In order to deal with relaxing the monotone condition.
this issue, we will use intervals only of length 1=2x for some x. At We have shown that every monotone ranged hash function can
first we choose the largest x such that 1=2x 1=C log(C ). Then,
be obtained by associating each item with a random permutation
of buckets. The most natural monotone consistent hash function assume only that each machine knows about a 1=t fraction of the
is obtained by choosing these permutations independently and uni- caches chosen by an adversary. There is no difference in the proto-
formly at random. We denote this function by f~. col, except that the mapping h is a consistent hash function. This
change will not affect latency. Therefore, we only analyze the ef-
Theorem 4.5 The function f~ is monotonic and universal. For item fects on swamping and storage. The basic properties of consistent
i and bucket b each of the following hold with probability at least hashing are crucial in showing that the protocol still works well.
1 1=N : (ip
) t log(NV ) and In particular, the blowup in the number of requests and storage is
(b) (1 + 4ItC )tI log(2NV I )=C 2.
proportional to the maximum and of the hash function.
Proof: Monotonicity and universality are immediate; this leaves 5.1 Swamping
Theorem 5.1 If h is implemented using the log(C )-way indepen-
spread and load. Define:
dent consistent hash function of Theorem 4.1 and if each view con-
= t log(NV ) ! sists of C 0 = C=t caches then with probability at least 1 1=C (1)
r an arbitrary cache gets no more than O((2t2 logd C 0 log C ) +
= 1 + 4tIC tI log(2CNV I ) (dqt log C + t) log C ) requests.
Proof (sketch): We look at the different trees of caches for dif-
ferent views for one page, p. Let C 0 = C=t denote the number
We use 0 (i) to denote a list of the buckets in V1 [ : : : [ VV
which are ordered as in (i).
of caches in each tree. We overlay these different trees on one an-
First, consider spread. Recall that in a particular view, item i
other to get a new tree where in each node, there is a set of caches.
is assigned to the first bucket in (i) which is also in the view.
Due to the spread property of the consistent hash function at most
Therefore, if every view contains one of the first buckets in 0 (i)
= O(t log C ) caches appear at any node in this combined tree
with high probability. In fact since there are only R requests, this
then in every view item i will be assigned to one of the first
will be true for the nodes of all the R trees for the requested pages.
buckets in 0 (i). This implies that item i is assigned to at most
If Ep;j denotes the event that m appears in the j th node of the
combined tree for page p then we know from Corrollary 4.2 that
distinct buckets over all the views.
the probability of this event is O(=C ), where is the load which
We have to show that with high probability every view contains
one of the first buckets in 0 (i). We do this by showing that the
is O(t log C ) with high probability. We condition on the event that
complement has low probability; that is, the probability that some
view contains none of the first buckets is at most 1=N .
and are O(t log C ) which happens with high probability.
Since a cache in a node sends out at most q requests, each node
in the combined tree sends out at most q requests. We now adapt
The probability that a particular view does not contain the first
bucket in 0 (i) is at most 1 1=t, since each view contains at least
a 1=t fraction of all buckets. The fact that the first bucket is not
the proof of Theorem 3.1 to this case. In Theorem 3.1 where every
machine was aware of all the C caches, an abstract node was as-
signed to any given machine with probability 1=C . We now assign
in a view only reduces the probability that subsequent buckets are
and abstract node to a given machine with probability (=C ). So
not in the view. Therefore, the probability that a particular view
contains none of the first buckets is at most (1 1=t) = (1
we have a scenario with C 0 = C=t caches where each abstract node
1=t)(t log(NV )) < 1=(NV ). By the union bound, the probability sends out up to q 0 requests to its parent and m occurs at each ab-
that even one of the V views contains none of the first buckets is stract node independently and with probability (=C ). The rest
at most 1=N . of the proof is very similar to that of Theorem 3.1.
Now consider load. By similar reasoning, every item i in every
view is assigned to one of the first t log(NV I ) buckets in 0 (i)
with probability at least 1 1=(2N ). We show below that a fixed
bucket b appears among the first t log(2NV I ) buckets in 0 (i) for 5.2 Storage
at most items i with probability at least 1 1=(2N ). By the union
bound, both events occur with high probability. This implies that at Using techniques similar to those in proof of Theorem 5.1 we get
most items are assigned to bucket b over all the views. the following lemma. The proof is deferred to the final version of
All that remains is to prove the second statement. The ex- the paper.
pected number of items i for which the bucket b appears among
the first t log(2NV I ) buckets in 0 (i) is tI log(2NV I )=C . Us- Lemma 5.2 The total number of cached pages, over all machines
ing Chernoff bounds, we find that bucket b appears among the first is O((log R log d C + R (1)
q )) with probability of 1 1=R . A
t log(2NV I ) buckets in 0 (i) for at most items i with probability given cache m has O((=q + log R)) cached copies with high
at least 1 1=(2NV I ) 1 1=(2N ). probability.
A simple approach to constructing a consistent hash function is 6 Nonuniform Communication Costs
to assign random scores to buckets, independently for each item.
Sorting the scores defines a random permutation, and therefore has So far we assumed that every pair of machines can communicate
the good properties proved in the this section. However, finding the with equal ease. In this section we extend our protocol to take
bucket an item belongs in requires computing all the scores. This the latency between machines, , into account. The latency of the
could be restrictivly slow for large bucket sets. whole request will be the sum of the latencies of the machine-
machine links crossed by the request. For simplicity, we assume
5 Random Trees in an Inconsistent World in this section that all clients are aware of all caches.
We extend our protocol to a restricted class of functions . In
particular, we assume that is an ultrametric. Formally, an ultra-
In this section we apply the techniques developed in the last sec-
metric is a metric which obeys a more strict form of the triangle
tion to the simple hot spot protocol developed in section 3. We now
relax the assumption that clients know about all of the caches. We
inequality: (x; z ) max( (x; y ); (y; z )).
The ultrametric is a natural model of Internet distances, since it tocol, mi plays the protocol on a set of at least i machines. So m is
essentially captures the hierarchical nature of the Internet topology, on the path of the request from mi with probability O((logd i)=i).
under which, for example, all machines in a given university are Summing over i, the expected load on m is O(log C ).
equidistant, but all of them are farther away from another univer- Stating things slightly more formally, we consider a set of log C
sity, and still farther from another continent. The logical point-to- C f g
nested “virtual” clusters i = m1 ; : : : ; m2i . Note that any
point connectivity is established atop a physical network, and it is C
browser in i+1 C C
i will use all machines in i in the protocol. We
generally the case that the latency between two sites is determined modify the protocol so that such a machine uses only the machines
by the “highest level” physical communication link that must be C
in i . This only reduces the number of machines it uses. According
traversed on the path between them. Indeed, another definition of to the monotonicity property of our consistent hash functions, this
an ultrametric is as a hierarchical clustering of points. The distance only increases the load on machine m.
in the ultrametric between two points is completely determined by C
Now we can consider each i separately and apply the static
the smallest cluster containing both of the points. analysis. The total number of requests arriving in one of the clusters
under the modified protocol is proportional to the number of caches
6.1 Protocol in the cluster, so our static analysis applies to the cluster. This
gives us a bound of O(log d C ) on the load induced on m by i . C
The only modification we make to the protocol is that when a Sumnming over the log C clusters proves the theorem.
browser maps the tree nodes to caches, it only uses caches that
are as close to it as the server of the desired page. By doing this,
we insure that our path to the server does not contain any caches 6.2.2 Storage
that are unnecessarily far away in the metric. The mapping is done Using techniques similar to those in proof of Theorem 6.1 we get
using a consistent hash function, which is the vital element of the the following lemma.
solution.
Clearly, requiring that browsers use “nearby” caches can cause Lemma 6.2 The total number of cached pages, over all machines
swamping if there is only one cache and server near many browsers. is O(R=q log R logd C log C ) with probability of 1 1=R (1) . A
Thus, in order to avoid cases of degenerate ultrametrics where there given cache m has O(log C (=q +log R)) cached copies with high
are browsers that are not close to any cache, and where there are probability.
clusters in the ultrametric without any caches in them, we restrict
the set of ultrametrics that may be presented to the protocol. The re-
striction is that in any cluster the ratio of the number of caches to the
7 Fault Tolerance
number of browsers may not fall below 1= (recall that R = C ).
Basically, as in Plaxton/Rajaraman, the fact that our protocol uses
This restriction makes sense in the real world where caches are
random short paths to the server makes it fault tolerant. We con-
likely to be evenly spread out over the Internet. It is also neces-
sider a model in which an adversary designates that some of the
sary, as we can prove that a large number of browsers clustered
caching machines may be down, that is, ignore all attempts at com-
around one cache can be forced to swamp that cache in some cir-
munication. Remember that our adversary does not get to see our
cumstances.
random bits, and thus cannot simply designate all machines at the
top of a tree to be down. The only restriction is that a specified
6.2 Analysis fraction s of the machines in every view must be up. Under our
It is clear from the protocol and the definition of an ultrametric that protocol, no preemptive caching of pages is done. Thus, if a server
the latency will be no more than the depth of the tree, logd C , times goes down, all pages that it has not distributed become inaccessible
the latency between the browser and the server. So once again we to any algorithm. This problem can be eliminated using standard
need only look at swamping and storage. The intuition is that inside techniques, such as Rabin' s Information Dispersal Algorithm [10].
each cluster the bounds we proved for the unit distance model ap- So we ignore server faults.
ply. The monotone property on consistent hashing will allow us to Observe that a request is satisfied if and only if all the caches
restrict our analysis to log(C ) clusters. Thus, summing over these serving for the nodes of the tree path are not down. Since each
clusters we have only a log(C ) blowup in the bound. node is mapped to a machine (k-wise) independently, it is trivial
(using standard Chernoff bounds) to lower bound the number of
abstract nodes that have working paths to the root. This leads to the
6.2.1 Swamping following lemma:
Theorem 6.1 Let be an ultrametric. Suppose that each browser Lemma 7.1 Suppose that d = (log N ). With high probability,
makes at most one request. Then in the protocol above, an arbi- the fraction of abstract-tree leaves that have a working path to the
trary cache gets no more than log C ((8 logd C + O( logloglogNN )) + root is (slogd C ). In particular, if s = 1 O(1= logd C ), this
O( dqdqlog N + log N )) requests with probability at least 1
log( log N )
fraction is a constant.
1=N where N is a parameter. The modification to the protocol is therefore quite simple.
Choose a parameter t, and simultaneously send t requests for the
Proof (sketch): The intuition behind are proof is the following. We page. A logarithmic number if requests is sufficient to give a high
bound the load on a machine m. Consider the ranking of machines probability of one of the requests goes through. This change in the
m1 ; ; m2 ; : : : according to their distance from m. Suppose mi asks protocol will of course have an impact on the system. This impact
for a page from a machine closer to itself than m. Then according is described in the full paper.
to our modified protocol, it will never involve m in the request. So Note that since communication is a chancy thing on the Inter-
we need only consider machine mi if it asks for a page at least as far net, failure to get a quick response from a machine is not a par-
away from itself as m. It follows from the definition of ultrametrics ticularly good indication that it is down. Thus, we focused on the
that every mj , j i, is also used in the revised protocol by mi . tolerance of faults, and not on their detection. However, given some
Intuitively, our original protocol spread load among the ma- way to decide that a machine is down, our consistent hash functions
chines so that the probability a machine got on the path for a par- make it trivial to reassign the work to other machines. If a you de-
ticular page requests was O((logd C )=C ). In our ultrametric pro- cide a machine is down, remove it from your view.
8 Adding Time to the Model information from a server to all the client members of a multicast
“group.” Our protocol can be mapped into this model if we assume
So far, we have omitted any real mention of time from our analy- that every machine “caching” a page joins a multicast group for that
ses. We have instead considered and analyzed a single “batch” of page. Even without multicast, if each cache keeps track, for each
R requests, and argued that this batch causes a limited amount of page it caches, of the at most d other caches it has given the page
caching (storage usage) at every machine, while simultaneously ar- to, then notification of changes can be sent down the tree to only
guing that no machine gets swamped by the batch. In this section, the caches that have copies.
we show how this static analysis carries implications for a temporal It remains open how to deal with time when modeling the In-
model in which requests arrive over time. Recall that our temporal ternet, because the communication protocols have no guarantees
model says that browsers issues requests at a certain rate . regarding time of delivery. Indeed, at the packet level, there are not
Time is a problematic issue when modeling the Internet, be- even guarantees regarding eventual delivery. This suggests model-
cause the communication protocols for it have no guarantees re- ing the Internet as some kind of distributed system. Clearly, in a
garding time of delivery. Thus any one request could take arbi- model in which there are no guarantees regarding delivery times,
trarily long. However, we can consider the rate at which servers the best one can hope to prove is some of the classical liveness and
receive requests. This seems like an overly simplistic measure, but safety properties underlying distributed algorithms. It is not clear
the rate at which a machine can receive requests is in fact the statis- what one can prove about caching and swamping in such a model.
tic that hardware manufacturers advertise. We consider an interval We think that there is significant research to be done on the proper
of time , and apply our “requests all come at once” analysis to the way to model this aspect of the Internet.
requests that come in this interval. We also believe that interesting open questions remain regard-
We can write the bounds from the static analysis on R requests ing the method of consistent hashing that we present in this paper.
as follows: Among them are the following. Is there a k-universal consistent
hash function that can be evaluated efficiently?? What tradeoffs
cache size = as R + bs cache load = al R + bl can be achieved between spread and load? Are there some kind of
Suppose machines have cache size m. Consider a time interval
“perfect” consistent hash functions that can be constructed deter-
small enough to make R = small enough so that m > as R +bs .
ministically with the same spread and load bounds we give? On
what other theoretical problems can consistent hashing give us a
In other words, the number of requests that arrive in this interval is handle?
insufficient, according to our static analysis, to use storage exceed-
ing m per machine. Thus once a machine caches a page during this
interval, it keeps it for the remainder of the interval. Thus our static References
analysis will apply over this interval. This gives us a bound on how
many requests can arrive in the interval. Dividing by the interval [1] Anawat Chankhunthod, Peter Danzig, Chuck Neerdaels, Michael
length, we get the rate at which caches see requests: (al + mbl abss ).
Schwartz and Kurt Worrell. A Hierarchical Internet Object Cache. In
USENIX Proceedings, 1996.
Plugging in the bounds from Section 3, we get the following: [2] Robert Devine. Design and Implementation of DDH: A Distributed
Theorem 8.1 If our machines have m = (log N ) storage, for
Dynamic Hashing Algorithm. In Proceedings of 4th International Con-
some constant N , then with probability 1=N , the bound on the
ference on Foundations of Data Organizations and Algorithms, 1993.
[3] M. J. Feeley, W. E. Morgan, F. P. Pighin, A. R. Karlin, H. M. Levy
when we have C machines of size
rate of new requests per cache and C. A. Thekkath. Implementing Global Memory Management in a
m is 2 logCd C + O dqm . Workstation Cluster. In Proceedings of the 15th ACM Symposium on
Operating Systems Principles, 1995.
Observe the tradeoffs implicit in this theorem. Increasing [4] Sally Floyd, Van Jacobson, Steen McCanne, Ching-Gung Liu and
m causes the load to decrease proportionately, but never below Lixia Zhang. A Reliable Multicast Framework for Light-weight Ses-
( log C=C ). Increasing d increases the load linearly (but re- sions and Application Level Framing, SIGCOMM' 95
duces the number of hops on a request path). Increasing q seems [5] Witold Litwin, Marie-Anne Neimat and Donovan A. Schneider.
only to hurt, suggesting that we should always take q = 2. LH -A Scalable, Distributed Data Structure. ACM Transactions on
Database Systems, Dec. 1996
The above analysis used the rate at which requests were issued [6] Radhika Malpani, Jacob Lorch and David Berger. Making World Wide
to measure the rate at which connections are established to ma- Web Caching Servers Cooperate. In Proceedings of World Wide Web
chines. If we also assume that each connection lasts for a finite Conference, 1996.
duration, this immediately translates into a bound on the number of [7] M. Naor and A. Wool. The load, capacity, and availability of quorum
connections open at a machine at any given time. systems. In Proceedings of the 35th IEEE Symposium on Foundations
of Computer Science, pages 214-225, November 1994.
9 Conclusion [8] D. Peleg and A. Wool. The availability of quorum systems. Information
and Computation 123(2):210-233, 1995.
[9] Greg Plaxton and Rajmohan Rajaraman. Fast Fault-Tolerant Concur-
This paper has focused on one particular caching problem—that rent Access to Shared Objects. In Proceedings of 37th IEEE Sympo-
of handling read requests on the Web. We believe the ideas have sium on Foundations of Computer Science, 1996.
broader applicability. In particular, consistent hashing may be a [10] M. O. Rabin. Efficient dispersal of Information for Security, Load Bal-
useful tool for distributing information from name servers such as ancing, and Fault Tolerance. Journal of the ACM 36:335–348, 1989.
DNS and label servers such as PICS in a load-balanced and fault- [11] Jeanette Schmidt, Alan Siegel and Aravind Srinivasan. Chernoff-
tolerant fashion. Our two schemes may together provide an inter- Hoeffding Bounds for Applications with Limited Independence. In
Proc. 4th ACS-SIAM Symposium on Discrete Algorithms, 1993.
esting method for constructing multicast trees [4].
Another important way in which our ideas could be extended
is in handling pages whose information changes over time, due to
either server or client activity. If we augment our protocol to let the
server know which machines are currently caching its page, then
the server can notify such caches whenever the data on its pages
changes. This might work particularly well in conjunction with the
currently under development multicast protocols [4] that broadcast