CS 382: Network-Centric Computing
Scalable Distributed Storage:
DHTs and Consistent Hashing
Zartash Afzal Uzmi
Spring 2023-24
ACK: Slides use some material from Scott Shenker (UC Berkeley) and Jim Kurose (UMass)
Agenda
⚫ Back to the Application Layer
⚫ P2P networks
⚫ Distributed Storage of Data
⚫ Next Lecture
⚫ Preview
2
Note: Change in our topic schedule
⚫ We will be discussing topics related to PA4 first, then move back to the
link layer and other topics
3
How Internet apps are architected?
Client-Server Architecture
How Internet apps are architected?
Client-Server Architecture
User Remote
device services
Internet
5
How Internet apps are architected?
Peer to Peer architecture
Internet
The course so far …
⚫ Studied fundamental networking concepts
7
The next few lectures of the course
⚫ Study fundamental concepts in building systems for distributed
applications
8
Next Few Classes: Fundamental Questions
⚫ How to distribute data for access by application components?
⚫ DHTs, consistent hashing
⚫ How to coordinate in a distributed application?
⚫ Time synchronization
⚫ How to design for fault tolerance?
⚫ Failures, replication, consistency
9
Today’s Lecture
⚫ How to partition data for a large distributed application?
10
Motivation
⚫ Present-day internet applications often store vast amounts of data and
need fast access to it
⚫ Examples:
⚫ E-commerce websites like Amazon store shopping carts for millions of users during
peak times
⚫ Social media and social networking websites store pictures, friends lists, reels,
conversation threads, for billions of users
⚫ A Content Distribution Network like Akamai caches more than 10% world’s web
content
11
Distributed Hash Tables
12
Hash Tables
⚫ A simple database that stores (key, value) pairs
⚫ Operations?
⚫ put(key, value)
⚫ get(key): returns the value Key Value
⚫ delete(key): deletes the (key, value) pair 132-54-3570 Tariq Usman
761-55-3791 Hina Akram
⚫ Expected time for operations: O(1)
385-41-0902 Rida Ali
⚫ Example: (ID,Name) 441-89-1956 Ali Ahmed
⚫ Another example: 217-66-5609 Afshan Khan
……. ………
⚫ Key: movie title
177-23-0199 Naeem Ullah
⚫ Value: list of IP addresses
13
Why is it called a “HASH” table?
⚫ Often the key stored is a numerical representation
⚫ A number instead of a movie name. Why?
⚫ More convenient and efficient to store and search
⚫ Typically internal (stored) key is the hash of the original key
Original Key Key Value
The Prestige 8962458 15, 10.2.5
[Link]
Godfather 7800356 [Link]
Heat 1567109 [Link]
The English Patient 2360012 [Link]
[Link]
Jerry McGuire 5430938 [Link]
……. ………
14
Interstellar 9290124 [Link]
Problem
⚫ Many applications (e.g., Facebook, Amazon apps) require access to
billions of (key, value) pairs
⚫ Applications need fast access to these (key, value) pairs
⚫ Example: Memcached (an in-memory key/value store)
⚫ If all data is stored on a single server, it can become a bottleneck
⚫ Too many entries in one place
⚫ Too much query traffic to handle
⚫ Don’t want to fit in a single computer
⚫ Distributed Hash Tables (DHTs) extend a hash table to span multiple
machines
15
Distributed Hash Tables
⚫ A distributed hash table data structure
⚫ (key, value) pairs stored in multiple machines
⚫ Where to search/query the DHT?
⚫ At any of the nodes!
⚫ It will “find” the answer
16
How to partition (key,value) pairs
across servers?
17
Desired Properties
⚫ Balanced load distribution
⚫ (key, value) pairs stored over millions of peers
⚫ No server has “too many” data items – even distribution is ideal!
⚫ The number of queries is also evenly distributed (roughly)
⚫ Smoothness
⚫ On addition/removal of servers, minimize the number of keys that need to be
relocated – robust to peers coming and going (churn)
⚫ Scalability – this is important!
⚫ Each peer only knows about a small number of other peers
⚫ To resolve a query, a small number of messages exchanged among peers
18
Example Problem
⚫ Suppose you are working at Netflix, and have to store movies on
different servers
⚫ Movies are the objects you need to store on servers
⚫ Movie names (or hash of names) are keys in the hash table
⚫ What are the values?
⚫ Entire movie data?
⚫ An IP address?
⚫ Or even a server ID (that can, perhaps, be translated to an IP address)
19
Solution#1
⚫ Use random assignment to map a movie to a server
⚫ Is this load balanced?
⚫ Yes in terms of pairs per peer
⚫ For key popularity? (also Yes)
⚫ Is this scalable?
⚫ Scalable in storage!
⚫ Querying peer has no idea where to look
⚫ May need to query all the peers in the DHT to “find” the answer
⚫ Is this Smooth?
⚫ Do lots of keys need relocation on the addition/removal of a node?
20
Solution#2
⚫ Use a hash function to (modulo) map a movie to a server
⚫ Server ID = hash(movie-name) % num_servers
⚫ What is the key? The value? for our (key, value) pair
⚫ Hash(movie-name) is key
⚫ Server ID is the value
⚫ Total number of keys >> num_servers
⚫ Load balanced? Scalable? Smooth?
21
Example
# of Servers = 3 (Server IDs: 0,1,2)
# of Movies = 5 Suppose server with ID 2 crashes.
Server ID = hash (movie-name) % num_servers How do the mappings change?
Movie Name Hash (movie-name) Hash (movie-name) % 3 Updated Server ID
Server ID (after failure)
Spirited-Away 10 1 0
Inside Out 11 2 1
Klaus 13 1 1
Howl’s Moving Castle 14 2 0
The Little Prince 15 0 1
22
Issue
⚫ The solution doesn’t ensure the smoothness property
⚫ Example: 𝑁=10 servers, 𝐾=1000 pairs, use server ID = key % 𝑁
⚫ Add one server ➜ will need to move about 99% of keys
⚫ Approx 1 − 𝑁/𝐾 keys need to move on average (per key)
⚫ Is the solution load balanced?
⚫ For (key, value) storage: Yes, for a good hash function
⚫ For the number of queries per node: Yes for even key popularity
⚫ Is the solution scalable?
⚫ A small number of internal messages to resolve a query?
⚫ Yes – knowing the key, the server ID is readily known
23
What do we need?
⚫ A solution that doesn’t depend on the number of servers
⚫ When adding or removing servers, the number of keys that need to be relocated is
minimized
24
Solution#3: Consistent Hashing
⚫ Provides nice smoothness and data balancing properties
⚫ Widely used in industry:
⚫ Amazon’s Dynamo data store
⚫ Used for Amazon’s e-commerce website
⚫ Facebook’s Memcached system
⚫ Akamai’s Content Delivery Network
⚫ Google’s storage systems
⚫ Microsoft Azure’s Storage System
⚫ Apache Casandra storage system
⚫ For in-network load balancing
⚫ Google’s Maglev load balancer
⚫ …
25
Consistent Hashing: Construction
⚫ Use an n-bit identifier for:
⚫ Keys: use hash(for example, movie name)
⚫ Servers: use hash(for example, server IP)
⚫ Use a standard hash function such as SHA-1
⚫ Servers and Keys both get mapped to a number
⚫ A number from 0 to 2n-1
⚫ Where and how to store the (k,v) pairs?
⚫ Store (k,v) at a server that is the immediate successor of k
⚫ The closest clockwise server greater than or equal to k
⚫ Servers and keys mapped to an “abstract” circle or hash ring
26
Example
1
⚫ Let’s use a 6-bit hash function
⚫ Assume 8 servers are available 12
⚫ Take a 6-bit hash of IP addresses 60
⚫ Get: 1,12,13,25,32,40,48,60
⚫ Create the hash ring
13
⚫ For each key, use a 6-bit hash to get
the key identifier k (say 35) 48 (35,v)
⚫ Store (k,v) at the immediate successor 25
of k
⚫ Resolving query: how many servers 40
should a server be connected to? 32
27
Circular DHT in the internet
1
12
60
13
48
25
40
32
“overlay network” 28
Consistent Hashing: Summary
⚫ Partitions the key-space among servers
⚫ Servers choose random identifiers
⚫ For example, hash (IP)
⚫ Keys randomly distributed in ID‐space:
⚫ hash(movie-name)
⚫ Spreads ownership of keys evenly across servers
29
Questions?
30
Consistent Hashing: Load Balancing
⚫ Each server owns 1/Nth of the ID space in expectation
⚫ Where N is the number of servers
⚫ What happens if a server fails?
⚫ If a server fails, its successor takes over the space
⚫ Smoothness goal: only the failed server’s keys get relocated
⚫ But now successor owns 2/Nth of the key space
⚫ Failures can upset the load balance
⚫ What if servers have different capacities?
⚫ The basic algorithm is oblivious to node heterogeneity
31
How to better load balance?
⚫ Identify the core reason for the load-balancing issue
⚫ When a server fails, all its storage falls onto the successor
⚫ Can we increase server storage granularity along the ring?
⚫ Try representing each server as multiple (V) virtual servers
⚫ Spread the virtual servers along the ring
⚫ Failure of a physical node
⚫ All virtual instances (of this server) will fail
⚫ The (key,value) pairs to be relocated will remain the same…but
⚫ Will fall onto V successors (instead of just one)
⚫ Better load balancing!!!
32
Virtual Nodes in DHT
How to implement?
movie6
Server3-1
Normally, we use hash(IP) to get the server ID Server2-2
movie1
For V virtual servers, use: movie5
Hash(IP+”1”)
Hash(IP+”2”) Server1-1
…
Hash(IP+”V”) Server1-2
movie2
movie4
Server 1 physically fails now (Server Server3-2
1-1, Server 1-2 failed) Server2-1
movie3
The storage of Serv 1-1 and Server 1-2 gets
relocated to the two immediate successors
33
Virtual Nodes: Summary
⚫ Idea: Each physical node now maintains V > 1 virtual nodes
⚫ Each virtual node owns an expected 1/(VN)th of ID space
⚫ Upon a physical node’s failure, V successors take over
⚫ Result: Better load balance with larger V
⚫ The number of virtual nodes that a node is responsible for can be decided
based on its capacity
34
Theoretical Results
⚫ For any set 𝑁 of nodes, and 𝐾 keys, with high probability:
𝐾
⚫ Each node is responsible for at most 1 + 𝜖 keys
𝑁
⚫ 𝜖 can be reduced to an arbitrarily small constant by having each node
run Ο(log 𝑁) virtual nodes
⚫ When an (𝑁 + 1)𝑠𝑡 node joins or leaves the network, responsibility for
𝐾
Ο keys changes hands (and only to and from the joining or leaving
𝑁
node)
35
Proven by D. Lewin in his work “Consistent hashing and random trees”, Master thesis, MIT, 1998
Summary
⚫ How do we partition data in distributed systems?
⚫ With a goal to achieve balance and smoothness
⚫ Consistent hashing is widely used
⚫ Provides smoothness, scalability, and load balancing
⚫ Load balancing can be impacted under node removal or addition
⚫ Virtual nodes
⚫ Can help with load imbalance caused by failures/additions
⚫ Also handles different server capacities
36
Next Lecture
⚫ How to efficiently locate where a data item is stored in a distributed
system?
37
Resolving a Query
What is the value
1 associated with
key 53 ?
value 12
60
13
48
O(N) messages 25
on avgerage to resolve
query, when there 40
32
are N peers 38
Questions?
39