0% found this document useful (0 votes)

16 views39 pages

Data Partitioning

The document discusses distributed hash tables which are used to store large amounts of key-value pair data across multiple servers. It explains the desired properties of balanced load distribution, smoothness during changes, and scalability. Different solutions for mapping keys to servers are considered, with issues around smoothness during changes. The goal is a solution that minimizes the number of keys needing relocation when servers are added or removed.

Uploaded by

dayyanali789

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views39 pages

Data Partitioning

Uploaded by

dayyanali789

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS 382: Network-Centric Computing

Scalable Distributed Storage:

DHTs and Consistent Hashing

Zartash Afzal Uzmi

Spring 2023-24

ACK: Slides use some material from Scott Shenker (UC Berkeley) and Jim Kurose (UMass)
Agenda
⚫ Back to the Application Layer
⚫ P2P networks
⚫ Distributed Storage of Data

⚫ Next Lecture
⚫ Preview

2
Note: Change in our topic schedule
⚫ We will be discussing topics related to PA4 first, then move back to the
link layer and other topics

3
How Internet apps are architected?
Client-Server Architecture
How Internet apps are architected?
Client-Server Architecture
User Remote
device services

Internet

5
How Internet apps are architected?
Peer to Peer architecture

Internet
The course so far …
⚫ Studied fundamental networking concepts

7
The next few lectures of the course
⚫ Study fundamental concepts in building systems for distributed
applications

8
Next Few Classes: Fundamental Questions
⚫ How to distribute data for access by application components?
⚫ DHTs, consistent hashing

⚫ How to coordinate in a distributed application?

⚫ Time synchronization

⚫ How to design for fault tolerance?

⚫ Failures, replication, consistency

9
Today’s Lecture
⚫ How to partition data for a large distributed application?

10
Motivation
⚫ Present-day internet applications often store vast amounts of data and
need fast access to it

⚫ Examples:
⚫ E-commerce websites like Amazon store shopping carts for millions of users during
peak times
⚫ Social media and social networking websites store pictures, friends lists, reels,
conversation threads, for billions of users
⚫ A Content Distribution Network like Akamai caches more than 10% world’s web
content

11
Distributed Hash Tables

12
Hash Tables
⚫ A simple database that stores (key, value) pairs
⚫ Operations?
⚫ put(key, value)
⚫ get(key): returns the value Key Value
⚫ delete(key): deletes the (key, value) pair 132-54-3570 Tariq Usman
761-55-3791 Hina Akram
⚫ Expected time for operations: O(1)
385-41-0902 Rida Ali
⚫ Example: (ID,Name) 441-89-1956 Ali Ahmed
⚫ Another example: 217-66-5609 Afshan Khan
……. ………
⚫ Key: movie title
177-23-0199 Naeem Ullah
⚫ Value: list of IP addresses
13
Why is it called a “HASH” table?
⚫ Often the key stored is a numerical representation
⚫ A number instead of a movie name. Why?
⚫ More convenient and efficient to store and search
⚫ Typically internal (stored) key is the hash of the original key
Original Key Key Value
The Prestige 8962458 15, 10.2.5
[Link]
Godfather 7800356 [Link]
Heat 1567109 [Link]
The English Patient 2360012 [Link]
[Link]
Jerry McGuire 5430938 [Link]
……. ………
14
Interstellar 9290124 [Link]
Problem
⚫ Many applications (e.g., Facebook, Amazon apps) require access to
billions of (key, value) pairs
⚫ Applications need fast access to these (key, value) pairs
⚫ Example: Memcached (an in-memory key/value store)
⚫ If all data is stored on a single server, it can become a bottleneck
⚫ Too many entries in one place
⚫ Too much query traffic to handle
⚫ Don’t want to fit in a single computer

⚫ Distributed Hash Tables (DHTs) extend a hash table to span multiple

machines
15
Distributed Hash Tables
⚫ A distributed hash table data structure
⚫ (key, value) pairs stored in multiple machines

⚫ Where to search/query the DHT?

⚫ At any of the nodes!
⚫ It will “find” the answer

16
How to partition (key,value) pairs
across servers?

17
Desired Properties
⚫ Balanced load distribution
⚫ (key, value) pairs stored over millions of peers
⚫ No server has “too many” data items – even distribution is ideal!
⚫ The number of queries is also evenly distributed (roughly)

⚫ Smoothness
⚫ On addition/removal of servers, minimize the number of keys that need to be
relocated – robust to peers coming and going (churn)

⚫ Scalability – this is important!

⚫ Each peer only knows about a small number of other peers
⚫ To resolve a query, a small number of messages exchanged among peers
18
Example Problem
⚫ Suppose you are working at Netflix, and have to store movies on
different servers

⚫ Movies are the objects you need to store on servers

⚫ Movie names (or hash of names) are keys in the hash table
⚫ What are the values?
⚫ Entire movie data?
⚫ An IP address?
⚫ Or even a server ID (that can, perhaps, be translated to an IP address)

19
Solution#1
⚫ Use random assignment to map a movie to a server

⚫ Is this load balanced?

⚫ Yes in terms of pairs per peer
⚫ For key popularity? (also Yes)
⚫ Is this scalable?
⚫ Scalable in storage!
⚫ Querying peer has no idea where to look
⚫ May need to query all the peers in the DHT to “find” the answer
⚫ Is this Smooth?
⚫ Do lots of keys need relocation on the addition/removal of a node?
20
Solution#2
⚫ Use a hash function to (modulo) map a movie to a server

⚫ Server ID = hash(movie-name) % num_servers

⚫ What is the key? The value? for our (key, value) pair
⚫ Hash(movie-name) is key
⚫ Server ID is the value
⚫ Total number of keys >> num_servers

⚫ Load balanced? Scalable? Smooth?

21
Example
# of Servers = 3 (Server IDs: 0,1,2)
# of Movies = 5 Suppose server with ID 2 crashes.
Server ID = hash (movie-name) % num_servers How do the mappings change?

Movie Name Hash (movie-name) Hash (movie-name) % 3 Updated Server ID

Server ID (after failure)
Spirited-Away 10 1 0
Inside Out 11 2 1
Klaus 13 1 1
Howl’s Moving Castle 14 2 0
The Little Prince 15 0 1

22
Issue
⚫ The solution doesn’t ensure the smoothness property
⚫ Example: 𝑁=10 servers, 𝐾=1000 pairs, use server ID = key % 𝑁
⚫ Add one server ➜ will need to move about 99% of keys
⚫ Approx 1 − 𝑁/𝐾 keys need to move on average (per key)
⚫ Is the solution load balanced?
⚫ For (key, value) storage: Yes, for a good hash function
⚫ For the number of queries per node: Yes for even key popularity
⚫ Is the solution scalable?
⚫ A small number of internal messages to resolve a query?
⚫ Yes – knowing the key, the server ID is readily known

23
What do we need?
⚫ A solution that doesn’t depend on the number of servers
⚫ When adding or removing servers, the number of keys that need to be relocated is
minimized

24
Solution#3: Consistent Hashing
⚫ Provides nice smoothness and data balancing properties
⚫ Widely used in industry:
⚫ Amazon’s Dynamo data store
⚫ Used for Amazon’s e-commerce website

⚫ Facebook’s Memcached system

⚫ Akamai’s Content Delivery Network
⚫ Google’s storage systems
⚫ Microsoft Azure’s Storage System
⚫ Apache Casandra storage system
⚫ For in-network load balancing
⚫ Google’s Maglev load balancer
⚫ …

25
Consistent Hashing: Construction
⚫ Use an n-bit identifier for:
⚫ Keys: use hash(for example, movie name)
⚫ Servers: use hash(for example, server IP)
⚫ Use a standard hash function such as SHA-1
⚫ Servers and Keys both get mapped to a number
⚫ A number from 0 to 2n-1
⚫ Where and how to store the (k,v) pairs?
⚫ Store (k,v) at a server that is the immediate successor of k
⚫ The closest clockwise server greater than or equal to k
⚫ Servers and keys mapped to an “abstract” circle or hash ring
26
Example
1

⚫ Let’s use a 6-bit hash function

⚫ Assume 8 servers are available 12
⚫ Take a 6-bit hash of IP addresses 60
⚫ Get: 1,12,13,25,32,40,48,60
⚫ Create the hash ring
13
⚫ For each key, use a 6-bit hash to get
the key identifier k (say 35) 48 (35,v)
⚫ Store (k,v) at the immediate successor 25
of k
⚫ Resolving query: how many servers 40
should a server be connected to? 32
27
Circular DHT in the internet
1

12
60

48
25

40
32
“overlay network” 28
Consistent Hashing: Summary
⚫ Partitions the key-space among servers

⚫ Servers choose random identifiers

⚫ For example, hash (IP)

⚫ Keys randomly distributed in ID‐space:

⚫ hash(movie-name)

⚫ Spreads ownership of keys evenly across servers

29
Questions?

30
Consistent Hashing: Load Balancing
⚫ Each server owns 1/Nth of the ID space in expectation
⚫ Where N is the number of servers

⚫ What happens if a server fails?

⚫ If a server fails, its successor takes over the space
⚫ Smoothness goal: only the failed server’s keys get relocated
⚫ But now successor owns 2/Nth of the key space
⚫ Failures can upset the load balance

⚫ What if servers have different capacities?

⚫ The basic algorithm is oblivious to node heterogeneity
31
How to better load balance?
⚫ Identify the core reason for the load-balancing issue
⚫ When a server fails, all its storage falls onto the successor
⚫ Can we increase server storage granularity along the ring?
⚫ Try representing each server as multiple (V) virtual servers
⚫ Spread the virtual servers along the ring
⚫ Failure of a physical node
⚫ All virtual instances (of this server) will fail
⚫ The (key,value) pairs to be relocated will remain the same…but
⚫ Will fall onto V successors (instead of just one)
⚫ Better load balancing!!!

32
Virtual Nodes in DHT
How to implement?
movie6
Server3-1
Normally, we use hash(IP) to get the server ID Server2-2
movie1
For V virtual servers, use: movie5
Hash(IP+”1”)
Hash(IP+”2”) Server1-1
…
Hash(IP+”V”) Server1-2
movie2
movie4
Server 1 physically fails now (Server Server3-2
1-1, Server 1-2 failed) Server2-1
movie3
The storage of Serv 1-1 and Server 1-2 gets
relocated to the two immediate successors
33
Virtual Nodes: Summary
⚫ Idea: Each physical node now maintains V > 1 virtual nodes

⚫ Each virtual node owns an expected 1/(VN)th of ID space

⚫ Upon a physical node’s failure, V successors take over

⚫ Result: Better load balance with larger V

⚫ The number of virtual nodes that a node is responsible for can be decided
based on its capacity
34
Theoretical Results
⚫ For any set 𝑁 of nodes, and 𝐾 keys, with high probability:
𝐾
⚫ Each node is responsible for at most 1 + 𝜖 keys
𝑁
⚫ 𝜖 can be reduced to an arbitrarily small constant by having each node
run Ο(log 𝑁) virtual nodes
⚫ When an (𝑁 + 1)𝑠𝑡 node joins or leaves the network, responsibility for
𝐾
Ο keys changes hands (and only to and from the joining or leaving
𝑁
node)

35
Proven by D. Lewin in his work “Consistent hashing and random trees”, Master thesis, MIT, 1998
Summary
⚫ How do we partition data in distributed systems?
⚫ With a goal to achieve balance and smoothness

⚫ Consistent hashing is widely used

⚫ Provides smoothness, scalability, and load balancing
⚫ Load balancing can be impacted under node removal or addition

⚫ Virtual nodes
⚫ Can help with load imbalance caused by failures/additions
⚫ Also handles different server capacities

36
Next Lecture
⚫ How to efficiently locate where a data item is stored in a distributed
system?

37
Resolving a Query

What is the value

1 associated with
key 53 ?
value 12

13
48
O(N) messages 25
on avgerage to resolve
query, when there 40
32
are N peers 38
Questions?

DHTLookup
No ratings yet
DHTLookup
56 pages
Highly Available DHTs: Data Consistency
No ratings yet
Highly Available DHTs: Data Consistency
11 pages
Chord
No ratings yet
Chord
15 pages
Hashing Techniques in Distributed Systems
No ratings yet
Hashing Techniques in Distributed Systems
38 pages
Dhts and Their Application To The Design of Peer-To-Peer Systems
No ratings yet
Dhts and Their Application To The Design of Peer-To-Peer Systems
70 pages
Prefix Hash Tree: An Indexing Data Structure Over Distributed Hash Tables
No ratings yet
Prefix Hash Tree: An Indexing Data Structure Over Distributed Hash Tables
10 pages
Distributed Hash Table Overview
No ratings yet
Distributed Hash Table Overview
9 pages
28 Consistent Hashing
No ratings yet
28 Consistent Hashing
6 pages
05 DHT
No ratings yet
05 DHT
30 pages
DK Stalk
No ratings yet
DK Stalk
105 pages
Understanding Distributed Hash Tables
100% (1)
Understanding Distributed Hash Tables
49 pages
Understanding Distributed Hash Tables
No ratings yet
Understanding Distributed Hash Tables
12 pages
21 p2p
No ratings yet
21 p2p
64 pages
Coral: Scalable Peer-to-Peer Content Distribution
No ratings yet
Coral: Scalable Peer-to-Peer Content Distribution
6 pages
Consistent Hashihhhhng Explained
No ratings yet
Consistent Hashihhhhng Explained
28 pages
Hashing in Networked Systems: Mike Freedman
No ratings yet
Hashing in Networked Systems: Mike Freedman
24 pages
Introducao A DHTs
No ratings yet
Introducao A DHTs
39 pages
Understanding Consistent Hashing in DHTs
No ratings yet
Understanding Consistent Hashing in DHTs
2 pages
Peer-to-Peer (P2P) Systems: DHT, Chord, Pastry
No ratings yet
Peer-to-Peer (P2P) Systems: DHT, Chord, Pastry
122 pages
DHT Presentation
No ratings yet
DHT Presentation
8 pages
NoSQL Data Management Techniques
No ratings yet
NoSQL Data Management Techniques
27 pages
CN 2.5 P2P Applications
No ratings yet
CN 2.5 P2P Applications
27 pages
DTUnit 1 & 2
No ratings yet
DTUnit 1 & 2
69 pages
Dxhash: A Scalable Consistent Hashing Based On The Pseudo-Random Sequence
No ratings yet
Dxhash: A Scalable Consistent Hashing Based On The Pseudo-Random Sequence
12 pages
Designing Consistent Hashing Systems
No ratings yet
Designing Consistent Hashing Systems
13 pages
Complex Queries in DHT-based Peer-to-Peer Networks: Matthew Harren Joseph M. Hellerstein Ryan Huebsch
No ratings yet
Complex Queries in DHT-based Peer-to-Peer Networks: Matthew Harren Joseph M. Hellerstein Ryan Huebsch
6 pages
An Introduction To Peer-to-Peer Networks: Presentation For CSE620:Advanced Networking Anh Le Nov. 4
No ratings yet
An Introduction To Peer-to-Peer Networks: Presentation For CSE620:Advanced Networking Anh Le Nov. 4
128 pages
Peer To Peer Networks: P2P Application Examples
No ratings yet
Peer To Peer Networks: P2P Application Examples
15 pages
21 p2p DHT
No ratings yet
21 p2p DHT
63 pages
Overview of Chord and DKS P2P Networks
No ratings yet
Overview of Chord and DKS P2P Networks
11 pages
Consistent Hashing - Explanation and Implementation
No ratings yet
Consistent Hashing - Explanation and Implementation
7 pages
Pseudo-DHT: Distributed Search Algorithm For P2P Video Streaming
No ratings yet
Pseudo-DHT: Distributed Search Algorithm For P2P Video Streaming
10 pages
Amazon Dynamo: A Key-Value Store Overview
No ratings yet
Amazon Dynamo: A Key-Value Store Overview
23 pages
Mod-I (Distributed DBMS)
No ratings yet
Mod-I (Distributed DBMS)
5 pages
P2P Middleware and DHT Routing Overlays
No ratings yet
P2P Middleware and DHT Routing Overlays
39 pages
Load Balancing in P2P Cloud Networks
No ratings yet
Load Balancing in P2P Cloud Networks
5 pages
Lect26 After
No ratings yet
Lect26 After
28 pages
CS Presentation 1
No ratings yet
CS Presentation 1
1 page
A Survey of DHT Security Techniques: Protection (E.g., Firewalls)
No ratings yet
A Survey of DHT Security Techniques: Protection (E.g., Firewalls)
49 pages
Understanding Consistent Hashing Basics
No ratings yet
Understanding Consistent Hashing Basics
19 pages
Distributed Hash Tables Overview
No ratings yet
Distributed Hash Tables Overview
16 pages
The International Journal of Engineering and Science (The IJES)
No ratings yet
The International Journal of Engineering and Science (The IJES)
9 pages
Distributed NoSQL for Extreme-Scale Systems
No ratings yet
Distributed NoSQL for Extreme-Scale Systems
45 pages
Distributed System Using Hash Table
No ratings yet
Distributed System Using Hash Table
4 pages
Big Data Indexing with FreeSplit
No ratings yet
Big Data Indexing with FreeSplit
8 pages
Distributed Hash Tables Guide
No ratings yet
Distributed Hash Tables Guide
20 pages
07 Hash Tables 4 Distributed Hash Tables
No ratings yet
07 Hash Tables 4 Distributed Hash Tables
97 pages
Anchorhash: A Scalable Consistent Hash
No ratings yet
Anchorhash: A Scalable Consistent Hash
12 pages
DHT
No ratings yet
DHT
26 pages
Distributed Hash Tables: David Tam Patrick Pang
No ratings yet
Distributed Hash Tables: David Tam Patrick Pang
26 pages
Introduction to Consistent Hashing
No ratings yet
Introduction to Consistent Hashing
11 pages
Skip Graphs for P2P Networks
No ratings yet
Skip Graphs for P2P Networks
10 pages
Web Caches, CDNS, and P2Ps
No ratings yet
Web Caches, CDNS, and P2Ps
7 pages
CT113H Lecture 4 - Key-Value Stores
No ratings yet
CT113H Lecture 4 - Key-Value Stores
55 pages
4 - Key-Value Stores
No ratings yet
4 - Key-Value Stores
47 pages
Nosql 1
No ratings yet
Nosql 1
40 pages
Distributed Systems: Naming & Consistency
No ratings yet
Distributed Systems: Naming & Consistency
65 pages
Class 7 - Scaling, Sharding, Consistent Hashing
No ratings yet
Class 7 - Scaling, Sharding, Consistent Hashing
4 pages
Idea Vodafone Merger
No ratings yet
Idea Vodafone Merger
5 pages
Sidewinder Manual
No ratings yet
Sidewinder Manual
327 pages
IAS Module 1
No ratings yet
IAS Module 1
9 pages
3HE17250AAAATQZZA - V1 - NSP NFM-P 21.3 XML API Developer Guide
100% (1)
3HE17250AAAATQZZA - V1 - NSP NFM-P 21.3 XML API Developer Guide
366 pages
Internship Report Hamro Patro Final
No ratings yet
Internship Report Hamro Patro Final
47 pages
Data Mining Using Javascript
No ratings yet
Data Mining Using Javascript
30 pages
Mastering LinkedIn Etiquette Tips
No ratings yet
Mastering LinkedIn Etiquette Tips
1 page
Eco Chat: Connect & Discuss
No ratings yet
Eco Chat: Connect & Discuss
21 pages
Effective SaaS Growth Hacking Strategies
No ratings yet
Effective SaaS Growth Hacking Strategies
54 pages
Classification 1 Definition and Classification of Cyber Crime
No ratings yet
Classification 1 Definition and Classification of Cyber Crime
8 pages
Configure Windows XP Firewall Guide
No ratings yet
Configure Windows XP Firewall Guide
3 pages
Asi Wind - Repertoire: You've Uploaded 0 of The 5 Required Documents
No ratings yet
Asi Wind - Repertoire: You've Uploaded 0 of The 5 Required Documents
2 pages
Chapter Two Version Control System: By: Fozia Abako
No ratings yet
Chapter Two Version Control System: By: Fozia Abako
30 pages
History of Index Number
100% (1)
History of Index Number
8 pages
IDOR Vulnerability Report for moncallcenter.ma
No ratings yet
IDOR Vulnerability Report for moncallcenter.ma
3 pages
UNIT-1 E-Commerce and Its Technological Aspects
No ratings yet
UNIT-1 E-Commerce and Its Technological Aspects
48 pages
Kristy Santos: Profiles and Social Media
No ratings yet
Kristy Santos: Profiles and Social Media
1 page
PM SVANidhi Lender's Portal Guide
No ratings yet
PM SVANidhi Lender's Portal Guide
11 pages
Empowerment Technology Module 1 - PPT
No ratings yet
Empowerment Technology Module 1 - PPT
81 pages
Microsoft Threat Protection KQL Queries
100% (1)
Microsoft Threat Protection KQL Queries
1 page
22ETC15 - M 2 Vtuupdates
No ratings yet
22ETC15 - M 2 Vtuupdates
53 pages
Exchange Server Troubleshooting Companion
100% (1)
Exchange Server Troubleshooting Companion
312 pages
Install Pgbadger On An OpenShift Cluster and Schedule It To Run Daily For Comprehensive PostgreSQL Performance Analysis
No ratings yet
Install Pgbadger On An OpenShift Cluster and Schedule It To Run Daily For Comprehensive PostgreSQL Performance Analysis
4 pages
CompTIA Security+ Certification Practice Test 1 (Exam SY0-701)
No ratings yet
CompTIA Security+ Certification Practice Test 1 (Exam SY0-701)
4 pages
AWS Lab 2: Deploy Web App on EC2 & RDS
No ratings yet
AWS Lab 2: Deploy Web App on EC2 & RDS
14 pages
TCP/IP Seminar Report Overview
No ratings yet
TCP/IP Seminar Report Overview
19 pages
Presentation 1
No ratings yet
Presentation 1
12 pages
Digital Nomadism: Work and Life Redefined
No ratings yet
Digital Nomadism: Work and Life Redefined
2 pages
Increase Visitors and Sales With Our Trusted SEO Services in UAE
No ratings yet
Increase Visitors and Sales With Our Trusted SEO Services in UAE
2 pages
LKSN 2020
No ratings yet
LKSN 2020
3 pages

Data Partitioning

Uploaded by

Data Partitioning

Uploaded by

CS 382: Network-Centric Computing

Scalable Distributed Storage:

Zartash Afzal Uzmi

⚫ How to coordinate in a distributed application?

⚫ How to design for fault tolerance?

⚫ Distributed Hash Tables (DHTs) extend a hash table to span multiple

⚫ Where to search/query the DHT?

⚫ Scalability – this is important!

⚫ Movies are the objects you need to store on servers

⚫ Is this load balanced?

⚫ Server ID = hash(movie-name) % num_servers

⚫ Load balanced? Scalable? Smooth?

Movie Name Hash (movie-name) Hash (movie-name) % 3 Updated Server ID

⚫ Facebook’s Memcached system

⚫ Let’s use a 6-bit hash function

⚫ Servers choose random identifiers

⚫ Keys randomly distributed in ID‐space:

⚫ Spreads ownership of keys evenly across servers

⚫ What happens if a server fails?

⚫ What if servers have different capacities?

⚫ Each virtual node owns an expected 1/(VN)th of ID space

⚫ Upon a physical node’s failure, V successors take over

⚫ Result: Better load balance with larger V

⚫ Consistent hashing is widely used

What is the value

You might also like