Hashing
The sequential search algorithm takes time
proportional to the data size, i.e, O(n).
Binary search improves on liner search reducing the search
time to O(log n).
With a BST, an O(log n) search efficiency can be
obtained; but the worst-case complexity is O(n).
To guarantee the O(log n) search time, BST height
balancing is required ( i.e., AVL trees).
Hashing
Suppose that we want to store 10,000 students records (each with
a 5-digit ID) in a given container.
·A linked list implementation would take O(n) time.
·A height balanced tree would give O(log n)access time.
·Using an array of size 100,000 would give O(1)access time but
will lead to a lot of space wastage.
Hashing
Is there some way that we could get O(1)
access without wasting a lot of space?
The answer is hashing.
Constant time per operation (on the average)
Like an array, come up with a function to map the
large range into one which we can manage.
Basic Idea
Use hash (or hashing) function to map hash
key into hash address (location) in a hash
table
Hash Function H: K -> L
If Student A has ID(Key) k and h is hash
function, then A’s Details is stored in
position h(k) of table
To search for A, compute h(k) to locate
position. If no element, hash table does
not contain A.
Example
Let keys be ID of 100 students And ID in
form of like 345610.
Now, we decided to take A[100]
And, Hash function is , say , LAST TWO DIGIT
So, 103062 will go to location 62 And same if
some one have 113062 Then again goes to the
location 62
THIS EVENT IS CALLED COLLISION
Collisions
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys) Collisions!
k5 k3 h(k3)
m-1
6
Collisions
Two or more keys hash to the same location
For a given set K of keys
If |K| ≤ m, collisions may or may not happen, depending on the
hash function
If |K| > m, collisions will definitely happen (i.e., there must be at
least two keys that have the same hash value)
Avoiding collisions completely is hard, even with a good hash
function
7
Collision Resolution
Methods:
Separate Chaining (open hashing)
Open addressing (closed hashing)
Linear probing
Quadratic probing
Double hashing
We will discuss chaining first, and ways to build “good”
functions.
8
Collision with Chaining - Discussion
Choosing the size of the table
Small enough not to waste space
Large enough such that lists remain short
Typically 1/5 or 1/10 of the total number of elements
How should we keep the lists: ordered or not?
Not ordered!
Insert is fast
Can easily remove the most recently inserted elements
Insertion in Hash Tables
Worst-case running time is O(1)
Assumes that the element being inserted isn’t already in the list
It would take an additional search to check if it was already inserted
10
Deletion in Hash Tables
Need to find the element to be deleted.
Worst-case running time:
Deletion depends on searching the corresponding list
11
Searching in Hash Tables
search for an element with key k in list T[h(k)]
Running time is proportional to the length of the list of
elements at location h(k)
12
Analysis of Hashing with Chaining:
Worst Case
How long does it take to search
T
for an element with a given key? 0
Worst case:
All n keys hash to the same slot
Worst-case time to search is Θ(n),
plus time to compute the hash
chain
function
m-1
13
Load Factor of a Hash Table
Load factor of a hash table T:
T
α = n/m 0
n = # of elements stored in the table chain
m = # of locations in the table = # of linked lists chain
α encodes the average number of elements chain
stored in a chain chain
α can be <, =, > 1 m-1
14
Hash Functions
A hash function transforms a hash key into a hash table address
What makes a good hash function?
(1) Easy to compute
(2) Approximates a random function: for every input, every output is equally likely (simple uniform
hashing)
In practice, it is very hard to satisfy the simple uniform hashing property
i.e., we don’t know in advance the probability distribution that keys are drawn from
15
Good Approaches for Hash Functions
Minimize the chance that closely related keys hash to the same slot
Strings such as pt and pts should hash to different slots
Derive a hash value that is independent from any patterns that may exist in the
distribution of the keys
Hash keys such as 199 and 499 should hash to different slots
16
The Division Method
Idea:
Map a key k into one of the m slots by taking the
remainder of k divided by m
h(k) = k mod m
m is usually chosen to be a prime number or a number without small divisors
to minimize the number of collisions
Advantage:
fast, requires only one operation
Disadvantage:
Certain values of m are bad, e.g.,
power of 2
non-prime numbers
17
h(K) = k mod M
Here,
k is the key value, and
M is the size of the hash table.
k = 12345
M = 95
h(12345) = 12345 mod 95
= 90
k = 1276
M = 11
h(1276) = 1276 mod 11
=0
The Midsquare Method
Idea:
Key k is squared.
h(k) = l
l is obtained by deleting digits from both
ends of k2
Same positions of
k2 are used for all the keys
K 3205 7148 2345
K2 10272025 51093904 5499025
H(k) 72 93 99
19
The mid-square method is a very good hashing method. It involves two steps to compute
the hash value-
Square the value of the key k i.e. k2
Extract the middle r digits as the hash value.
Suppose the hash table has 100 memory locations. So r = 2 because two digits are required
to map the key to the memory location.
k = 60
k x k = 60 x 60
= 3600
h(60) = 60
The hash value obtained is 60
The Folding Method
Idea:
Key k is partitioned into a number of parts k1,
k2,…,kr Where each part except possibly the last
has same number of digits as the required address
h(k) = k1+ k2+…+kr
Where leading-digit carries are ignored, if any
Sometimes even numbered parts are reversed
K 3205 7148 2345
32+05 71+48 23+45
H(k)
37 19 68
22
This method involves two steps:
Divide the key-value k into a number of parts i.e. k1, k2, k3,….,kn, where each part has the same number of digits
except for the last part that can have lesser digits than the other parts.
Add the individual parts. The hash value is obtained by ignoring the last carry if any.
Formula:
k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Here,
s is obtained by adding the parts of the key k
Open Addressing
If we have enough contiguous memory to store all the keys (m > N) ⇒ store the
keys in the table itself
It is called “open” because the address where key k is stored also
e.g., insert 14
depends on keys already stored in hash table along with h(k)
It is also called closed hashing
No need to use linked lists anymore
Hashing with chaining is called open hashing
Basic idea:
Insertion: if a slot is full, try another one,
until you find an empty one
Search: follow the same sequence of probes
Deletion: more difficult
Search time depends on the length of the
probe sequence!
Common Open Addressing Methods
Linear probing
Quadratic probing
Double hashing
25
Linear probing: Inserting a key
Idea: when there is a collision, check the next available
position in the table (i.e., probing)
(h(k) + i) mod m, i=0,1,2,...
First slot probed: h(k)
Second slot probed: h(k) + 1
Third slot probed: h(k)+2, and so on
probe sequence: < h(k), h(k)+1 , h(k)+2 , ....>
The process wraps around to the beginning of
the table
26 wrap around
Linear Probing Example
27
insert(14) insert(8) insert(21) insert(2)
14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2
0 0 0 0
14 14 14 14
1 1 1 1
8 8 8
2 2 2 2
21 12
3 3 3 3
2
4 4 4 4
5 5 5 5
6 6 6 6
1 1 3 2
probes:
Quadratic probing: Inserting a key
Idea: when there is a collision, check the next 0
available position in the table 1
2
(h(k) + i2) mod m, i=0,1,2,... 3
probe sequence: < h(k), h(k)+1 , h(k)+4 , ....> 4
First slot probed: h(k) 5
Second slot probed: h(k) + 1 6
Third slot probed: h(k)+4, and so on 7
The process wraps around to the beginning 8
of the table 9
10
Clustering problem is less serious
11
But it is still an issue (secondary clustering)
12
multiple keys hashed to the same spot all follow the same
probe sequence.
28
Quadratic Probing Example
29
insert(14) insert(8) insert(21) insert(2)
14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2
0 0 0 0
14 14 14 14
1 1 1 1
8 8 8
2 2 2 2
2
3 3 3 3
4 4 4 4
21 21
5 5 5 5
6 6 6 6
1 1 3 1
probes:
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the increment for the
probe sequence
(h1(k) + i h2(k) ) mod m, i=0,1,...
Initial probe: h1(k)
Second probe is offset by h2(k) mod m, so on ...
Advantage: avoids clustering
Disadvantage: harder to delete an element
30
Double Hashing: Example
h1(k) = k mod 13
0
h2(k) = 1+ (k mod 11) 1 79
h(k)= (h1(k) + i h2(k) ) mod 13 2
Insert key 14: 3
4 69
h1(14) = 14 mod 13 = 1
5 98
h(14) = (h1(14) + h2(14)) mod 13
6
= (1 + 4) mod 13 = 5 7 72
h(14,2) = (h1(14) + 2 h2(14)) mod 13 8
= (1 + 8) mod 13 = 9 9 14
10
11 50
12
31
Double Hashing Example
32
insert(14) insert(8) insert(21) insert(2) insert(7)
14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2 7%7 = 0
5-(21%5)=4 5-(21%5)=4
0 0 0 0 0
14 14 14 14 14
1 1 1 1 1
8 8 8 8
2 2 2 2 2
2 2
3 3 3 3 3
4 4 4 4 4
21 21 21
5 5 5 5 5
6 6 6 6 6
1 1 2 1 ??
probes:
Double Hashing Example
33
insert(14) insert(8) insert(21) insert(2) insert(56)
14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2 56%7 = 0
5-(21%5)=4 5-(56%5)=4
0 0 0 0 0
14 14 14 14 14
1 1 1 1 1
8 8 8 8
2 2 2 2 2
2 2
3 3 3 3 3
4 4 4 4 4
21 21 21
5 5 5 5 5
56
6 6 6 6 6
1 1 2 1 4
probes: