Hashing:
Hash Functions
Michael Levin
Department of Computer Science and Engineering
University of California, San Diego
Data Structures Fundamentals
Algorithms and Data Structures
Outline
1 Phone Book Data Structure
2 Universal Family
3 Hashing Phone Numbers
4 Hashing Names
5 Analysis of Polynomial Hashing
Phone Book
Design a data structure to store your
contacts: names of people along with their
phone numbers. The following operations
should be fast:
Add and delete contacts,
Call person by name,
Determine who is calling given their
phone number.
We need two Maps:
(phone number → name) and
(name → phone number)
We need two Maps:
(phone number → name) and
(name → phone number)
Implement these Maps as hash tables
We need two Maps:
(phone number → name) and
(name → phone number)
Implement these Maps as hash tables
First, we will focus on the Map from
phone numbers to names
Chaining for Phone Book
Select hash function h of cardinality m
Create array Chains of size m
Each element of Chains is a list of
pairs (name, phoneNumber), called
chain
Pair (name, phoneNumber) goes into
chain at position
h(ConvertToInt(phoneNumber)) in
the array Chains
Chaining for Phone Book
0
1 Sasha14052391717
2
3
4 Maria01707773331 Helen15025757575
5
6
7
Parameters
n contacts stored
Parameters
n contacts stored
m — cardinality of the hash function
Parameters
n contacts stored
m — cardinality of the hash function
Note: hash table size for chaining
should be equal to m
Parameters
n contacts stored
m — cardinality of the hash function
Note: hash table size for chaining
should be equal to m
c — length of the longest chain
Parameters
n contacts stored
m — cardinality of the hash function
Note: hash table size for chaining
should be equal to m
c — length of the longest chain
Θ(n + m) memory is used
Parameters
n contacts stored
m — cardinality of the hash function
Note: hash table size for chaining
should be equal to m
c — length of the longest chain
Θ(n + m) memory is used
Operations run in time Θ(c + 1)
Parameters
n contacts stored
m — cardinality of the hash function
Note: hash table size for chaining
should be equal to m
c — length of the longest chain
Θ(n + m) memory is used
Operations run in time Θ(c + 1)
You want small m and c! (but c ≥ mn )
Good Example
0
1 O3 O7
2 O1 c=2
m=8 3 O8 O4
4
5 O2
6 O6
7 O5
Bad Example
0
1 O1 O2 O3 O4 O5 O6 O7 O8
2 c=8
3
m=8
4
5
6
7
First Digits
For the map from phone numbers to
names, select m = 1000
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take first three digits
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take first three digits
h(800-123-45-67) = 800
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take first three digits
h(800-123-45-67) = 800
Problem: area code
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take first three digits
h(800-123-45-67) = 800
Problem: area code
h(425-234-55-67) =
h(425-123-45-67) =
h(425-223-23-23) = · · · = 425
Last Digits
Select m = 1000
Last Digits
Select m = 1000
Hash function: take last three digits
Last Digits
Select m = 1000
Hash function: take last three digits
h(800-123-45-67) = 567
Last Digits
Select m = 1000
Hash function: take last three digits
h(800-123-45-67) = 567
Problem if many phone numbers end
with three zeros
Random Value
Select m = 1000
Random Value
Select m = 1000
Hash function: random number
between 0 and 999
Random Value
Select m = 1000
Hash function: random number
between 0 and 999
Uniform distribution of hash values
Random Value
Select m = 1000
Hash function: random number
between 0 and 999
Uniform distribution of hash values
Different value when hash function
called again — we won’t be able to find
anything!
Random Value
Select m = 1000
Hash function: random number
between 0 and 999
Uniform distribution of hash values
Different value when hash function
called again — we won’t be able to find
anything!
Hash function must be deterministic
Good Hash Functions
Deterministic
Fast to compute
Distributes keys well into different cells
Few collisions
No Universal Hash Function
Lemma
If the number of possible keys is big
(|S| ≫ m), for any hash function h there is a
bad input resulting in many collisions.
S
m=3 S
h(k) = 2
h(k) = 0 h(k) = 1
m=3 S
h(k) = 2
h(k) = 0 h(k) = 1
m=3 S
h(k) = 2
h(k) = 0 h(k) = 1
42%
Outline
1 Phone Book Data Structure
2 Universal Family
3 Hashing Phone Numbers
4 Hashing Names
5 Analysis of Polynomial Hashing
Idea
Remember QuickSort?
Idea
Remember QuickSort?
Choosing random pivot helped
Idea
Remember QuickSort?
Choosing random pivot helped
Use randomization!
Idea
Remember QuickSort?
Choosing random pivot helped
Use randomization!
Define a family (set) of hash functions
Idea
Remember QuickSort?
Choosing random pivot helped
Use randomization!
Define a family (set) of hash functions
Choose random function from the family
Universal Family
Definition
Let U be the universe — the set of all
possible keys.
Universal Family
Definition
Let U be the universe — the set of all
possible keys. A set of hash functions
H = {h : U → {0, 1, 2, . . . , m − 1}}
Universal Family
Definition
Let U be the universe — the set of all
possible keys. A set of hash functions
H = {h : U → {0, 1, 2, . . . , m − 1}}
is called a universal family if
Universal Family
Definition
Let U be the universe — the set of all
possible keys. A set of hash functions
H = {h : U → {0, 1, 2, . . . , m − 1}}
is called a universal family if for any two keys
x, y ∈ U, x ̸= y the probability of collision
1
Pr[h(x) = h(y)] ≤
m
Universal Family
1
Pr[h(x) = h(y)] ≤
m
means that a collision h(x) = h(y) for any
fixed pair of different keys x and y happens
for no more than m1 of all hash functions
h ∈ H.
How Randomization Works
h(x) = random({0, 1, 2, . . . , m − 1})
gives probability of collision exactly m1 .
How Randomization Works
h(x) = random({0, 1, 2, . . . , m − 1})
gives probability of collision exactly m1 .
It is not deterministic — can’t use it.
How Randomization Works
h(x) = random({0, 1, 2, . . . , m − 1})
gives probability of collision exactly m1 .
It is not deterministic — can’t use it.
All hash functions in H are deterministic
How Randomization Works
h(x) = random({0, 1, 2, . . . , m − 1})
gives probability of collision exactly m1 .
It is not deterministic — can’t use it.
All hash functions in H are deterministic
Select a random function h from H
How Randomization Works
h(x) = random({0, 1, 2, . . . , m − 1})
gives probability of collision exactly m1 .
It is not deterministic — can’t use it.
All hash functions in H are deterministic
Select a random function h from H
Fixed h is used throughout the
algorithm
Load Factor
Definition
The ratio α = mn between number of objects
n stored in the hash table and the size of the
hash table m is called load factor.
Running Time
Lemma
If h is chosen randomly from a universal
family, the average length of the longest
chain c is O(1 + α), where α = mn is the load
factor of the hash table.
Corollary
If h is from universal family, operations with
hash table run on average in time O(1 + α).
Choosing Hash Table Size
Control amount of memory used with m
Choosing Hash Table Size
Control amount of memory used with m
Ideally, load factor 0.5 < α < 1
Choosing Hash Table Size
Control amount of memory used with m
Ideally, load factor 0.5 < α < 1
Use O(m) = O( αn ) = O(n) memory to
store n keys
Choosing Hash Table Size
Control amount of memory used with m
Ideally, load factor 0.5 < α < 1
Use O(m) = O( αn ) = O(n) memory to
store n keys
Operations run in time
O(1 + α) = O(1) on average
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Copy the idea of dynamic arrays!
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Copy the idea of dynamic arrays!
Resize the hash table when α becomes
too large
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Copy the idea of dynamic arrays!
Resize the hash table when α becomes
too large
Choose new hash function and rehash
all the objects
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehashing
Keep load factor below 0.9:
Rehash(T)
loadFactor ← T.numberOfKeys
T.size
if loadFactor > 0.9:
Create Tnew of size 2 × T.size
Choose hnew with cardinality Tnew .size
For each object in T:
Insert object in Tnew using hnew
T ← Tnew , h ← hnew
Rehash Running Time
You should call Rehash after each operation
with the hash table
Similarly to dynamic arrays, single rehashing
takes O(n) time, but amortized running time
of each operation with hash table is still O(1)
on average, because rehashing will be rare
Outline
1 Phone Book Data Structure
2 Universal Family
3 Hashing Phone Numbers
4 Hashing Names
5 Analysis of Polynomial Hashing
Hashing Phone Numbers
We can convert phone numbers to
integers, for example
148-25-67 → 1482567
Hashing Phone Numbers
We can convert phone numbers to
integers, for example
148-25-67 → 1482567
As a result, any phone number will be
converted to an integer less than 1015
Hashing Phone Numbers
We can convert phone numbers to
integers, for example
148-25-67 → 1482567
As a result, any phone number will be
converted to an integer less than 1015
If we come up with a universal family
for integers up to 1015, we will be able
to map phone numbers to names
efficiently using chaining
Hashing Integers
Lemma
{ }
Hp = ha,b
p (x) = ((ax + b) mod p) mod m
for all a, b : 1 ≤ a ≤ p − 1, 0 ≤ b ≤ p − 1 is
a universal family for the set of integers
between 0 and p − 1, for any prime p.
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = h34,2
p and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = h34,2
p and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
(34 × 1482567 + 2) mod 10000019 = 407185
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = h34,2
p and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
(34 × 1482567 + 2) mod 10000019 = 407185
407185 mod 1000 = 185
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = h34,2
p and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
(34 × 1482567 + 2) mod 10000019 = 407185
407185 mod 1000 = 185
h(x) = 185
Proof Ideas
For any pair of different keys (x, y), any
two of the p(p − 1) hash functions in
Hp hash them into different pairs (r, s)
of different remainders modulo p
Proof Ideas
For any pair of different keys (x, y), any
two of the p(p − 1) hash functions in
Hp hash them into different pairs (r, s)
of different remainders modulo p
Thus any pair (r, s) of different
remainders modulo p has equal
1
probability p(p−1)
Proof Ideas
For any pair of different keys (x, y), any
two of the p(p − 1) hash functions in
Hp hash them into different pairs (r, s)
of different remainders modulo p
Thus any pair (r, s) of different
remainders modulo p has equal
1
probability p(p−1)
The ratio of pairs (r, s) of different
remainders modulo p such that r ≡ s
(mod m) is less than m1
General Case
Define maximum length L of a phone
number
General Case
Define maximum length L of a phone
number
Convert phone numbers to integers
from 0 to 10L − 1
General Case
Define maximum length L of a phone
number
Convert phone numbers to integers
from 0 to 10L − 1
Choose prime number p > 10L
General Case
Define maximum length L of a phone
number
Convert phone numbers to integers
from 0 to 10L − 1
Choose prime number p > 10L
Choose hash table size m
General Case
Define maximum length L of a phone
number
Convert phone numbers to integers
from 0 to 10L − 1
Choose prime number p > 10L
Choose hash table size m
Choose random hash function from
universal family Hp (choose random
a ∈ [1, p − 1] and b ∈ [0, p − 1])
Outline
1 Phone Book Data Structure
2 Universal Family
3 Hashing Phone Numbers
4 Hashing Names
5 Analysis of Polynomial Hashing
Lookup Phone Numbers by Name
Time to implement Map from names to
phone numbers
Lookup Phone Numbers by Name
Time to implement Map from names to
phone numbers
Use chaining again
Lookup Phone Numbers by Name
Time to implement Map from names to
phone numbers
Use chaining again
Need a hash function for names
Lookup Phone Numbers by Name
Time to implement Map from names to
phone numbers
Use chaining again
Need a hash function for names
Hash arbitrary strings of bounded length
Lookup Phone Numbers by Name
Time to implement Map from names to
phone numbers
Use chaining again
Need a hash function for names
Hash arbitrary strings of bounded length
You will learn how string hashing is
implemented in Java!
String Length Notation
Definition
Denote by |S| the length of string S.
Examples
|“edx”| = 3
|“ucsd”| = 4
|“chaining”| = 8
Hashing Strings
Given a string S, compute its hash value
Hashing Strings
Given a string S, compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
are individual characters
Hashing Strings
Given a string S, compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
are individual characters
We should use all the characters in the
hash function
Hashing Strings
Given a string S, compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
are individual characters
We should use all the characters in the
hash function
Otherwise there will be many collisions
Hashing Strings
Given a string S, compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
are individual characters
We should use all the characters in the
hash function
Otherwise there will be many collisions
For example, if S[0] is not used, then
h(“aa”) = h(“ba”) = · · · = h(“za”)
Preparation
Convert each character S[i] to integer
Preparation
Convert each character S[i] to integer
ASCII encoding, Unicode, etc.
Preparation
Convert each character S[i] to integer
ASCII encoding, Unicode, etc.
Choose big prime number p
Polynomial Hashing
Definition
Family of hash functions
|S|−1
∑
x i
Pp = hp(S) = S[i]x mod p
i=0
with a fixed prime p and all 1 ≤ x ≤ p − 1 is
called polynomial.
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
Example: |S| = 3
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
Example: |S| = 3
1 hash ← 0
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
Example: |S| = 3
1 hash ← 0
2 hash ← S[2] mod p
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
Example: |S| = 3
1 hash ← 0
2 hash ← S[2] mod p
3 hash ← S[1] + S[2]x mod p
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash · x + S[i]) mod p
return hash
Example: |S| = 3
1 hash ← 0
2 hash ← S[2] mod p
3 hash ← S[1] + S[2]x mod p
4 hash ← S[0] + S[1]x + S[2]x mod p
2
Java Implementation
The method hashCode of the built-in Java
class String is very similar to our
PolyHash, it just uses x = 31 and for
technical reasons avoids the (mod p)
operator.
Java Implementation
The method hashCode of the built-in Java
class String is very similar to our
PolyHash, it just uses x = 31 and for
technical reasons avoids the (mod p)
operator.
You now know the implementation of the
function that is used trillions of times a day
in many thousands of programs!
Outline
1 Phone Book Data Structure
2 Universal Family
3 Hashing Phone Numbers
4 Hashing Names
5 Analysis of Polynomial Hashing
Lemma
For any two different strings s1 and s2 of
length at most L + 1, if you choose h from
Pp at random (by selecting a random
x ∈ [1, p − 1]), the probability of collision
Pr[h(s1) = h(s2)] is at most Lp .
Lemma
For any two different strings s1 and s2 of
length at most L + 1, if you choose h from
Pp at random (by selecting a random
x ∈ [1, p − 1]), the probability of collision
Pr[h(s1) = h(s2)] is at most Lp .
Proof idea
This follows from the fact that the equation
a0 + a1x + a2x2 + · · · + aLxL = 0 (mod p) for
prime p has at most L different solutions x.
Cardinality Fix
For use in a hash table of size m, we need a
hash function of cardinality m.
Cardinality Fix
For use in a hash table of size m, we need a
hash function of cardinality m.
First, apply random hx from Pp, and then
hash the resulting value again using universal
family for integers. Denote the resulting
function by hm.
Cardinality Fix
For use in a hash table of size m, we need a
hash function of cardinality m.
First, apply random hx from Pp, and then
hash the resulting value again using universal
family for integers. Denote the resulting
function by hm.
hm(S) = ha,b(hx(S)) mod m
Cardinality Fix
For use in a hash table of size m, we need a
hash function of cardinality m.
First, apply random hx from Pp, and then
hash the resulting value again using universal
family for integers. Denote the resulting
function by hm.
hm(S) = ha,b(hx(S)) mod m
Cardinality Fix
For use in a hash table of size m, we need a
hash function of cardinality m.
First, apply random hx from Pp, and then
hash the resulting value again using universal
family for integers. Denote the resulting
function by hm.
hm(S) = ha,b(hx(S)) mod m
Cardinality Fix
For use in a hash table of size m, we need a
hash function of cardinality m.
First, apply random hx from Pp, and then
hash the resulting value again using universal
family for integers. Denote the resulting
function by hm.
hm(S) = ha,b(hx(S)) mod m
Lemma
For any two different strings s1 and s2 of
length at most L + 1 and cardinality m, the
probability of collision Pr[hm(s1) = hm(s2)] is
at most m1 + Lp .
Polynomial Hashing
Corollary
If p > mL, for any two different strings s1
and s2 of length at most L + 1 the probability
of collision Pr[hm(s1) = hm(s2)] is O( m1 ).
Proof
1
m + Lp < 1
m
L
+ mL = 1
m + m1 = 2
m = O( m1 )
Running Time
For p > mL we have
c = O(1 + mn ) = O(1 + α) again
Running Time
For p > mL we have
c = O(1 + mn ) = O(1 + α) again
Computing PolyHash(S) runs in O(|S|)
Running Time
For p > mL we have
c = O(1 + mn ) = O(1 + α) again
Computing PolyHash(S) runs in O(|S|)
If lengths of the names in the phone
book are bounded by constant L,
computing h(S) takes O(L) = O(1)
time
Conclusion
You learned how to hash integers and
strings
Phone book can be implemented as two
hash tables
Mapping phone numbers to names and
back
Search and modification run in O(1) on
average!