Chapter 4
Chapter 4
WHAT IS HASHING?
It refers to the process of generating a fixed size output from an input of
variable size using the mathematical formulas known as hash functions. This
technique determines an index or location for the storage of an item in data
structure.
Now the question arises if Array was already there, what was the need for a new
data structure! The answer to this is in the word “efficiency “. Though storing in
Array takes O(1) time, searching in it takes at least O(log n) time. This time
appears to be small, but for a large data set, it can cause a lot of problems and
this, in turn, makes the Array data structure inefficient.
Components of Hashing
There are majorly three components of hashing:
1. Key: A Key can be anything string or integer which is fed as input in the
hash function the technique that determines an index or location for
storage of an item in a data structure.
2. Hash Function: The hash function receives the input key and returns the
index of an element in an array called a hash table. The index is known as
the hash index.
3. Hash Table: Hash table is a data structure that maps keys to values using
a special function called a hash function. Hash stores the data in an
associative manner in an array where each data value has its own unique
index.
The value computed by applying the hash function to the key is often referred to
as the hashed key. The entries into the array, are scattered (not necessarily
sequential) as can be seen in figure below.
The cost of the insert, find and delete operations is now only O(1). Can you think
of why? Hash tables are very good if you need to perform a lot of search
operations on a relatively stable table (i.e. there are a lot fewer insertion and
deletion operations than search operations). One the other hand, if traversals
(covering the entire table), insertions, deletions are a lot more frequent than
simple search operations, then ordered binary trees (also called AVL trees) are
the preferred implementation choice.
Hashing Performance
There are three factors the influence the performance of hashing:
o Hash function Separate Chaining: chain several keys/entries in the
same position
Table size
o Too large a table, will cause a wastage of memory
o Too small a table will cause increased collisions and eventually
force rehashing (creating a new hash table of larger size and copying
the contents of the current hash table into it)
o The size should be appropriate to the hash function used and
should typically be a prime number. Why? (We discussed this in
class).
1. Efficiently computable.
2. Should uniformly distribute the keys (Each table position equally likely for
each key)
For example: For phone numbers, a bad hash function is to take the first three
digits. A better function is considered the last three digits. Please note that this
may not be the best hash function. There may be better ways.
2. Number of collisions should be less while placing the record in the hash
table. Ideally no collision should occur. Such a function is called perfect hash
function.
3. Hash function should produce such keys which will get distributed uniformly
over an array.
4. The hash function should depend on every bit of the key. Thus, the hash
function that simply extracts the portion of a key is not suitable.
Modular Arithmetic: Compute the index by dividing the key with some
value and use the remainder as the index. This forms the basis of the next
two techniques.
TEACHER GUIDE:
Example:
Let's find 17mod 517 \mod 517mod5.
In modular arithmetic, we are essentially looking for the remainder when dividing one number by
another.
Conclusion:
The result of 17mod 517 \mod 517mod5 is 2.
Truncation: Ignoring part of the key and using the rest as the array index. The
problem with this approach is that there may not always be an even distribution
throughout the table.For Example: If student id’s are the key 928324312 then select
just the last three digits as the index i.e. 312 as the index. => the table size has to be at
least 999.
Conclusion:
The result of 23mod 723 \mod 723mod7 is 2.
Key Concept:
Truncation ensures that we work with only the integer quotient when dividing the
number. This process is important for finding the remainder in modular arithmetic.
o MID- SQUARE- is a hashing technique in which unique keys are
generated. In this technique, a seed value is taken and it is squared.
Then, some digits from the middle are extracted. These extracted
digits form a number which is taken as the new seed. This technique
can generate keys with high randomness if a big enough seed value
is taken. However, it has a limitation. As the seed is squared, if a 6-digit
number is taken, then the square will have 12-digits.
computer science. It involves squaring a number, and then taking the middle digits of the
result as the next number. The method can also be used in modular arithmetic.
TEACHER GUIDE:
How does the Mid-Square Method work?
1. Choose a number xxx (called the seed).
2. Square the number xxx.
3. Extract the middle digits from the result. The number of digits you extract depends on
how large the result should be.
4. Use the middle digits as the next number.
Example: Using the Mid-Square Method
Let's go through an example to illustrate the process.
Step 1: Choose a seed
Let’s start with a seed value of x=45x = 45x=45.
Step 2: Square the number
Now, square the number:
452=202545^2 = 2025452=2025
Step 3: Extract the middle digits
For a 4-digit number like 2025, we’ll extract the middle two digits. In this case, the middle
digits are 02 (since we are working with 4 digits).
Step 4: Use the middle digits as the next value
The next number to use in the sequence is 02.
We can now repeat this process with the number 02 as the new seed.
Second iteration:
1. Square 02:
022=402^2 = 4022=4
2. Extract the middle digits: Since the result is a single digit, we can add leading zeros if
necessary. For 4, the middle digits are 04.
3. Use 04 as the new seed and repeat the process.
Example:
Suppose the size of the Hash Table (m) = 10 (0 - 9) maximum digits
required for the index is 1
Element (x) = 12 ⇒ x2 = 144
Mid 1 digit of 144 is 4, so the element x=12 will be stored at the
index=4 in the hash table with the size of 10 slots.
Another Example:
Suppose the size of the Hash Table (m) = 1000 (0 - 999) maximum
The possible 3 digit mids of 7644179761 are 417 or 179, we can pick
any of those mids. If we pick 419 then the element x=87431 will be
stored at the index=419 in the hash table with the size of 1000
slots.
Folding: Partition the key into several pieces and then combine it in some
convenient way. it breaks up a key value into precise segments that are
added to form a hash value, and look at another technique is to apply a
multiplicative hash function to each segment individually before adding.
Some folding methods go one step further and reverse every other piece
before the addition. This folding method is independent of distribution.
Levi John A. Bernisto,
MIS
COMSC 3
TEACHERS GUIDE:
Example of Folding:
Let's say we want to generate a random number using the folding method with the seed
number 12345.
Step 1: Split the number into parts
First, we divide the number into two parts. For the number 12345, we could split it as
follows:
First part: 123
Second part: 45
Step 2: Add the parts together
Next, we sum the two parts:
123+45=168123 + 45 = 168123+45=168
Step 3: Apply modulo (optional)
If we want to limit the result within a certain range, we could apply modular arithmetic.
For example, if we want the result to be between 0 and 9, we could apply mod 10\mod
10mod10:
168mod 10=8168 \mod 10 = 8168mod10=8
So, the result of folding 12345 would be 8.
Folding in Random Number Generation:
Folding is often used in random number generation methods where the result is reduced
in size, or it's used to combine two separate random numbers to ensure a more diverse
range of possible outputs. It's especially useful in methods where you are trying to
introduce randomness from various sources, or where you have a long number and
want to generate a new number by folding its parts.
Pros of Folding:
Simple to implement.
Helps combine multiple pieces of information (like splitting a large number into smaller
segments).
Can improve the uniformity of random numbers by mixing parts of the number.
Example with More Complex Folding:
Let’s use the number 987654321 for a more complex folding example.
1. Split into parts:
o First part: 9876
o Second part: 54321
2. Sum the parts:
9876+54321=641979876 + 54321 = 641979876+54321=64197
3. Apply modulo (for a range of 0-9):
64197mod 10=764197 \mod 10 = 764197mod10=7
So, in this case, the folded result of 987654321 would be 7.
Conclusion:
Folding is a method of breaking a number into parts and combining those parts in some
way (like adding them together) to produce a result.
It is often used in random number generation and modular arithmetic for generating
results within a certain range.
It’s a simple and effective method to reduce or combine numbers, but it can introduce
cycles or patterns if not carefully designed.
Algorithm:
The folding method is used for creating hash functions starts with the item being
11 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
divided into equal-sized pieces i.e., the last piece may not be of equal size.
The outcome of adding these bits together is the hash value, H(x) = (a + b + c)
mod M, where a, b, and c represent the preconditioned key broken down into
three parts and M is the table size, and mod stands for modulo.
In other words, the sum of three parts of the preconditioned key is divided by
the table size. The remainder is the hash key.
Explanation:
In hashing algorithms (like those used for checking data integrity or in cryptographic applications), a
collision occurs when two different inputs (data, files, or numbers) result in the same hash value. Since
hash functions reduce data to a fixed-size output (hash), there's a possibility that two different inputs will
map to the same output.
Let’s say we have a hash function H(x)H(x)H(x) that generates a 3-digit hash. For simplicity:
H("apple")=123H("apple") = 123H("apple")=123
H("banana")=123H("banana") = 123H("banana")=123
12 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
In this case, both "apple" and "banana" map to the same hash value 123, which is a collision. This means
that the hash function isn't perfectly distinguishing between these two different inputs.
Conclusion:
A collision in modular arithmetic occurs when two different numbers give the same remainder
after applying the modulus.
In hashing, a collision happens when two different inputs produce the same hash value.
Collisions are generally undesirable, as they reduce the ability to distinguish between different
inputs and can cause security or data integrity problems.
Collision Resolution
1. Open Addressing:
TEACHERS GUIDE:
Open Addressing is a method used in hash tables to resolve collisions—which occur when
two different keys hash to the same index. Instead of storing multiple keys at the same
index (like in chaining, where a linked list is used to store multiple values), open
addressing resolves collisions by finding another open slot within the hash table.
How Open Addressing Works:
13 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
In open addressing, when a collision occurs (i.e., two keys hash to the same index), the
hash table searches for the next available slot using a probing technique. There are
different types of probing strategies, but the key idea is that you probe the table in a
systematic way to find an open slot.
Key Steps in Open Addressing:
1. Hash the key to find an initial index.
2. If the slot at that index is occupied by the desired key, the insertion is successful.
3. If the slot is occupied by a different key (a collision), a probing strategy is used to find the
next available slot.
The probing techniques define how the next index is calculated when a collision occurs.
There are several common types of probing:
2. Quadratic Probing:
Quadratic probing uses a quadratic function to calculate the next index. Instead of
linearly searching the next available slot, quadratic probing increases the distance
between probes in a quadratic manner.
o Formula for Quadratic Probing:
Next Index=(Current Index+i2)mod Table Size\text{Next Index} = (\text{Current Index}
+ i^2) \mod \text{Table Size}Next Index=(Current Index+i2)modTable Size
where iii is the number of attempts (starting with i=1i = 1i=1).
o Example: For a table of size 7 and a key 505050, let’s say we hash 50mod 7=150 \
mod 7 = 150mod7=1. If the slot at index 1 is taken, the next index checked will be:
(Index+12)mod 7=(1+1)mod 7=2(\text{Index} + 1^2) \mod 7 = (1 + 1) \mod 7 =
2(Index+12)mod7=(1+1)mod7=2
If index 2 is also taken, we check:
(Index+22)mod 7=(1+4)mod 7=5(\text{Index} + 2^2) \mod 7 = (1 + 4) \mod 7 =
5(Index+22)mod7=(1+4)mod7=5
This probing pattern continues quadratically.
3. Double Hashing:
Double hashing uses a second hash function to calculate the step size for probing, making
it more complex but potentially more efficient in reducing clustering.
o Formula for Double Hashing:
Next Index=(Current Index+i⋅h2(key))mod Table Size\text{Next Index} = (\text{Current
Index} + i \cdot h_2(\text{key})) \mod \text{Table Size}Next Index=(Current Index+i⋅h2
(key))modTable Size
where h2(key)h_2(\text{key})h2(key) is a second hash function, and iii is the number of
probes (attempts).
o Example: If the primary hash function gives an index of 3, and the second hash
function for the key gives h2(key)=2h_2(\text{key}) = 2h2(key)=2, we calculate the
next index using the formula:
14 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
Symbol tables: Linear probing is commonly used in symbol tables, which are
used in compilers and interpreters to store variables and their associated
values. Since symbol tables can grow dynamically, linear probing can be used to
15 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
Inserting Key 8:
Hashing: 8mod 5=38 \mod 5 = 38mod5=3
Insert key 8 at index 3.
16 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
Notice how the cluster at the beginning of the table (indices 0 to 4) is growing. Whenever
there’s a collision at index 3, the probe sequence will expand the cluster of filled slots.
The issue is that the longer this sequence continues, the more the cluster grows. As a result:
1. Inserting new keys becomes slower because the probe sequence will likely have to
traverse through many occupied slots.
2. Searching for an element can also become slower because a larger cluster of slots must
be checked.
Conclusion:
Primary Clustering is a common problem in open addressing (especially with linear
probing) that occurs when collisions cause consecutive slots to be filled, forming a large
cluster. This leads to slower insertion and search operations as the cluster grows. To mitigate
primary clustering, techniques like quadratic probing, double hashing, and rehashing are
commonly used to reduce the likelihood of consecutive slots being filled and improve overall
hash table performance.
2. Primary Clustering refers to the situation where these clusters of occupied slots keep
expanding in the table. If a new element hashes to an index that falls within or near an
existing cluster, it will be forced into this growing cluster, further increasing the number of
probes needed.
In essence, primary clustering happens because linear probing is not random enough to
avoid these clusters. The more collisions occur, the larger the cluster becomes, and the
slower the search and insertion operations will be.
TEACHERS GUIDE:
Example of Primary Clustering:
Consider a hash table of size 5 (indices 0 to 4), and let’s use linear probing to resolve
collisions. We'll insert the keys 8, 13, 18, and 23 into the hash table, and demonstrate how
primary clustering can develop.
Table Size: 5
Initial Table:
TEACHERS GUIDE:
Secondary Clustering is another issue that arises in open addressing when probing for
18 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
available slots in a hash table. Unlike primary clustering, which occurs specifically with linear
probing, secondary clustering occurs when quadratic probing or double hashing is used. It
refers to the phenomenon where, despite using a non-linear probing strategy (like quadratic
probing or double hashing), keys that hash to the same index still follow the same probe
sequence, causing the formation of clusters.
In essence, while primary clustering involves large clusters due to linear sequential probing,
secondary clustering involves keys following a similar probe pattern due to the probing
function, even when quadratic probing or double hashing is used.
19 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3
Another Example: Let us consider a simple hash function as “key mod 5” and a
sequence of keys that are to be inserted are 50, 70, 76, 93.
Step1: First draw the empty hash table which will have a possible range of hash
values from 0 to 4 according to the hash function provided.
Hash table
Step 2: Now insert all the keys in the hash table one by one. The first key is 50.
It will map to slot number 0 because 50%5=0. So insert it into slot number 0.
110 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
111 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
2. Quadratic Probing
If you observe carefully, then you will understand that the interval between
probes will increase proportionally to the hash value. Quadratic probing is a
method with the help of which we can solve the problem of clustering that was
discussed above. This method is also known as the mid-square method. In this
method, we look for the i2‘th slot in the ith iteration. We always start from the
original hash location. If only the location is occupied then we check the other
slots.
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and
collision resolution strategy to be f(i) = i2 . Insert = 22, 30, and 50.
Step 1: Create a table of size 7.
112 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
Hash table
Step 2 – Insert 22 and 30
Hash (22) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily insert
22 at slot 1.
Hash (30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert
30 at slot 2.
113 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
In our hash table slot 1 is already occupied. So, we will search for slot 1+i2, i.e.
1+1 = 2,
Again slot 2 is found occupied, so we will search for cell 1+i2, i.e.1+4 = 5,
1+22=1+4=5
Now, cell 5 is not occupied so we will place 50 in slot 5.
For example: Let us consider a simple hash function as “key mod 7” and
sequence of keys as 50, 700, 76, 85, 92, 73, 101
114 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
Closed Addressing:
Chaining: Store colliding keys in a linked list at each index.
The idea behind separate chaining is to implement the array as a linked list
called a chain. Separate chaining is one of the most popular and commonly
used techniques in order to handle collisions.
The linked list data structure is used to implement this technique. So what
happens is, when multiple elements are hashed into the same slot index, then
these elements are inserted into a singly-linked list which is known as a chain.
Here, all those elements that hash into the same slot index are inserted into a
linked list. Now, we can use a key K to search in the linked list by just linearly
traversing. If the intrinsic key for any entry is equal to K then it means that we
have found our entry. If we have reached the end of the linked list and yet we
haven’t found our entry then it means that the entry does not exist. Hence, the
conclusion is that in separate chaining, if two different elements have the same
hash value then we store both the elements in the same linked list one after the
other.
Example: Let us consider a simple hash function as “key mod 7” and a sequence
of keys as 50, 700, 76, 85, 92, 73, 101
115 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
Advantages:
Simple to implement.
Hash table never fills up, we can always add more elements to the chain.
Less sensitive to the hash function or load factors.
It is mostly used when it is unknown how many and how frequently keys may
be inserted or deleted.
Disadvantages:
The cache performance of chaining is not good as keys are stored using a linked
list. Open addressing provides better cache performance as everything is
stored in the same table.
Wastage of Space (Some Parts of the hash table are never used)
If the chain becomes long, then search time can become O(n) in the worst case
Uses extra space for links
Performance of Chaining:
Performance of hashing can be evaluated under the assumption that each key is
equally likely to be hashed to any slot of the table (simple uniform hashing).
116 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
Multiple-choice: We give a key two choices the h1(key) and h2(key) for residing.
Relocation: It may happen that h1(key) and h2(key) are preoccupied. This is
resolved by imitating the Cuckoo bird: it pushes the other eggs or young out of
the nest when it hatches. Analogously, inserting a new key into a cuckoo hashing
table may push an older key to a different location. This leaves us with the
problem of re-placing the older key.
Otherwise, the older key displaces another key. This continues until the
procedure finds a vacant position, or enters a cycle. In the case of a cycle, new
hash functions are chosen and the whole data structure is ‘rehashed’. Multiple
rehashes might be necessary before Cuckoo succeeds.
Insertion is expected O(1) (amortized) with high probability, even considering the
possibility of rehashing, as long as the number of keys is kept below half of the
capacity of the hash table, i.e., the load factor is below 50%.
Illustration
Input:
Hash Functions:
h1(key) = key%11
Let’s start by inserting 20 at its possible position in the first table determined by
h1(20):
Next: 50
117 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
Next: 67. h1(67) = 1. But 100 is already there at 1. We place 67 in table 1 & 100
in table 2
Next: 105. h1(105) = 6. But 50 is already there at 6. We place 105 in table 1 &
50 in table 2 at h2(50) = 4. Now 53 has been displaced. h1(53) = 9. 75 displaced:
h2(75) = 6.
Next: 3. h1(3) = 3.
118 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3
119 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3