0% found this document useful (0 votes)
37 views26 pages

Chapter 4

Hashing is a process that converts variable-sized input into a fixed-size output using hash functions, which helps in efficiently storing and accessing large amounts of data. It comprises three main components: keys, hash functions, and hash tables, with performance influenced by factors such as table size and collision resolution strategies. Good hash functions should be efficient, uniformly distribute keys, and minimize collisions, with techniques like modular arithmetic, truncation, mid-square, and folding being employed to generate hash values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views26 pages

Chapter 4

Hashing is a process that converts variable-sized input into a fixed-size output using hash functions, which helps in efficiently storing and accessing large amounts of data. It comprises three main components: keys, hash functions, and hash tables, with performance influenced by factors such as table size and collision resolution strategies. Good hash functions should be efficient, uniformly distribute keys, and minimize collisions, with techniques like modular arithmetic, truncation, mid-square, and folding being employed to generate hash values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

CHAPTER IV: HASHING

WHAT IS HASHING?
It refers to the process of generating a fixed size output from an input of
variable size using the mathematical formulas known as hash functions. This
technique determines an index or location for the storage of an item in data
structure.

Hashing is the process of scrambling a piece of information or data beyond


recognition. They are designed to be irreversible. We pass the input through a
hash function to calculate the hash value.

Need for Hash data structure


Every day, the data on the internet is increasing multifold and it is always a
struggle to store this data efficiently. In day-to-day programming, this amount
of data might not be that big, but still, it needs to be stored, accessed, and
processed easily and efficiently. A very common data structure that is used for
such a purpose is the Array data structure.

Now the question arises if Array was already there, what was the need for a new
data structure! The answer to this is in the word “efficiency “. Though storing in
Array takes O(1) time, searching in it takes at least O(log n) time. This time
appears to be small, but for a large data set, it can cause a lot of problems and
this, in turn, makes the Array data structure inefficient.

Components of Hashing
There are majorly three components of hashing:

1. Key: A Key can be anything string or integer which is fed as input in the
hash function the technique that determines an index or location for
storage of an item in a data structure.
2. Hash Function: The hash function receives the input key and returns the
index of an element in an array called a hash table. The index is known as
the hash index.

3. Hash Table: Hash table is a data structure that maps keys to values using
a special function called a hash function. Hash stores the data in an
associative manner in an array where each data value has its own unique
index.

Hashing using Arrays


When implementing a hash table using arrays, the nodes are not stored
consecutively, instead the location of storage is computed using the key and a
hash function. The computation of the array index can be visualized as shown
below:

The value computed by applying the hash function to the key is often referred to
as the hashed key. The entries into the array, are scattered (not necessarily
sequential) as can be seen in figure below.

The cost of the insert, find and delete operations is now only O(1). Can you think
of why? Hash tables are very good if you need to perform a lot of search
operations on a relatively stable table (i.e. there are a lot fewer insertion and
deletion operations than search operations). One the other hand, if traversals
(covering the entire table), insertions, deletions are a lot more frequent than
simple search operations, then ordered binary trees (also called AVL trees) are
the preferred implementation choice.

Hashing Performance
There are three factors the influence the performance of hashing:
o Hash function Separate Chaining: chain several keys/entries in the
same position
 Table size
o Too large a table, will cause a wastage of memory
o Too small a table will cause increased collisions and eventually
force rehashing (creating a new hash table of larger size and copying
the contents of the current hash table into it)
o The size should be appropriate to the hash function used and
should typically be a prime number. Why? (We discussed this in
class).

CHARATERISTICS OF A GOOD HASH FUNCTION


A good hash function should have the following properties:

1. Efficiently computable.
2. Should uniformly distribute the keys (Each table position equally likely for
each key)

For example: For phone numbers, a bad hash function is to take the first three
digits. A better function is considered the last three digits. Please note that this
may not be the best hash function. There may be better ways.

Rules for choosing good hash function:


1. The hash function should be simple to compute.

2. Number of collisions should be less while placing the record in the hash
table. Ideally no collision should occur. Such a function is called perfect hash
function.

3. Hash function should produce such keys which will get distributed uniformly
over an array.

4. The hash function should depend on every bit of the key. Thus, the hash
function that simply extracts the portion of a key is not suitable.

Selecting Hash Functions


The hash function converts the key into the table position. It can be carried out
using:
o should distribute the keys and entries evenly throughout the entire
table
o should minimize collision
 Collision resolution strategy
o Open Addressing: store the key/entry in a different position

 Modular Arithmetic: Compute the index by dividing the key with some
value and use the remainder as the index. This forms the basis of the next
two techniques.

For Example: index := key MOD table_size

TEACHER GUIDE:

Example:
Let's find 17mod 517 \mod 517mod5.
In modular arithmetic, we are essentially looking for the remainder when dividing one number by
another.

Step 1: Perform the division


Divide 17 by 5 and note the quotient and the remainder.
17÷5=3 with a remainder of 217 \div 5 = 3 \text{ with a remainder of }
217÷5=3 with a remainder of 2

Step 2: Identify the remainder


The division tells us that 17 divided by 5 gives a quotient of 3 (we ignore the decimal part), and the
remainder is 2.
17=5×3+217 = 5 \times 3 + 217=5×3+2

Step 3: Write the result in modular form


Since the remainder is 2, we say that:
17≡2mod 517 \equiv 2 \mod 517≡2mod5
This means that 17 is congruent to 2 when divided by 5, or equivalently, 17 leaves a remainder of 2
when divided by 5.

Conclusion:
The result of 17mod 517 \mod 517mod5 is 2.
 Truncation: Ignoring part of the key and using the rest as the array index. The
problem with this approach is that there may not always be an even distribution
throughout the table.For Example: If student id’s are the key 928324312 then select
just the last three digits as the index i.e. 312 as the index. => the table size has to be at
least 999.

Truncation in Modular Arithmetic


In modular arithmetic, truncation refers to how we handle the division process,
especially when dividing numbers to find remainders. Instead of working with
floating-point results, we truncate the quotient to its integer part.
Let's look at an example with truncation:

Example: Find 23mod 723 \mod 723mod7

Step 1: Perform the division


First, divide 23 by 7:
23÷7=3.2857…23 \div 7 = 3.2857\ldots23÷7=3.2857…

Step 2: Truncate the quotient


In truncation, we discard the decimal part of the division result. The integer part of
3.2857 is 3.
So, we now have:
Quotient=3\text{Quotient} = 3Quotient=3

Step 3: Multiply the quotient by the divisor


Now, multiply the quotient (3) by the divisor (7):
3×7=213 \times 7 = 213×7=21

Step 4: Subtract the product from the original number


Next, subtract the product (21) from the original number (23) to find the remainder:
23−21=223 - 21 = 223−21=2

Step 5: Express the result in modular form


The remainder is 2, so:
23≡2mod 723 \equiv 2 \mod 723≡2mod7

Conclusion:
The result of 23mod 723 \mod 723mod7 is 2.

Key Concept:
 Truncation ensures that we work with only the integer quotient when dividing the
number. This process is important for finding the remainder in modular arithmetic.
o MID- SQUARE- is a hashing technique in which unique keys are
generated. In this technique, a seed value is taken and it is squared.
Then, some digits from the middle are extracted. These extracted
digits form a number which is taken as the new seed. This technique
can generate keys with high randomness if a big enough seed value
is taken. However, it has a limitation. As the seed is squared, if a 6-digit
number is taken, then the square will have 12-digits.
computer science. It involves squaring a number, and then taking the middle digits of the
result as the next number. The method can also be used in modular arithmetic.

TEACHER GUIDE:
How does the Mid-Square Method work?
1. Choose a number xxx (called the seed).
2. Square the number xxx.
3. Extract the middle digits from the result. The number of digits you extract depends on
how large the result should be.
4. Use the middle digits as the next number.
Example: Using the Mid-Square Method
Let's go through an example to illustrate the process.
Step 1: Choose a seed
Let’s start with a seed value of x=45x = 45x=45.
Step 2: Square the number
Now, square the number:
452=202545^2 = 2025452=2025
Step 3: Extract the middle digits
For a 4-digit number like 2025, we’ll extract the middle two digits. In this case, the middle
digits are 02 (since we are working with 4 digits).
Step 4: Use the middle digits as the next value
The next number to use in the sequence is 02.
We can now repeat this process with the number 02 as the new seed.
Second iteration:
1. Square 02:
022=402^2 = 4022=4
2. Extract the middle digits: Since the result is a single digit, we can add leading zeros if
necessary. For 4, the middle digits are 04.
3. Use 04 as the new seed and repeat the process.

Pros and Cons of the Mid-Square Method


Pros:
 It's easy to implement.
 It generates numbers based on squaring and digit extraction, which can sometimes give
good pseudo-random sequences for small applications.
Cons:
 It has a tendency to fall into repetitive cycles, especially if the seed is not carefully chosen.
 It’s not suitable for cryptographic or high-quality random number generation, as the
numbers can quickly become predictable.

A Quick Example with Modulo


The mid-square method can also be combined with modular arithmetic to further
constrain the number range. For example, if you want to keep the results within a specific
range, you could use modular arithmetic after squaring the number to ensure the output is
within limits.
For instance, let's say after squaring, we want the result to be between 0 and 9. We could
use:
Result=(x2)mod 10\text{Result} = (x^2) \mod 10Result=(x2)mod10
This will give us a value between 0 and 9.

Example:
Suppose the size of the Hash Table (m) = 10 (0 - 9) maximum digits
required for the index is 1
Element (x) = 12 ⇒ x2 = 144
Mid 1 digit of 144 is 4, so the element x=12 will be stored at the
index=4 in the hash table with the size of 10 slots.
Another Example:
Suppose the size of the Hash Table (m) = 1000 (0 - 999) maximum

Element (x) = 87431 ⇒ x2 = 7644179761


digits required for the index is 3

The possible 3 digit mids of 7644179761 are 417 or 179, we can pick
any of those mids. If we pick 419 then the element x=87431 will be
stored at the index=419 in the hash table with the size of 1000
slots.

 Folding: Partition the key into several pieces and then combine it in some
convenient way. it breaks up a key value into precise segments that are
added to form a hash value, and look at another technique is to apply a
multiplicative hash function to each segment individually before adding.
Some folding methods go one step further and reverse every other piece
before the addition. This folding method is independent of distribution.
Levi John A. Bernisto,
MIS
COMSC 3

TEACHERS GUIDE:

Example of Folding:
Let's say we want to generate a random number using the folding method with the seed
number 12345.
Step 1: Split the number into parts
First, we divide the number into two parts. For the number 12345, we could split it as
follows:
 First part: 123
 Second part: 45
Step 2: Add the parts together
Next, we sum the two parts:
123+45=168123 + 45 = 168123+45=168
Step 3: Apply modulo (optional)
If we want to limit the result within a certain range, we could apply modular arithmetic.
For example, if we want the result to be between 0 and 9, we could apply mod 10\mod
10mod10:
168mod 10=8168 \mod 10 = 8168mod10=8
So, the result of folding 12345 would be 8.
Folding in Random Number Generation:
Folding is often used in random number generation methods where the result is reduced
in size, or it's used to combine two separate random numbers to ensure a more diverse
range of possible outputs. It's especially useful in methods where you are trying to
introduce randomness from various sources, or where you have a long number and
want to generate a new number by folding its parts.
Pros of Folding:
 Simple to implement.
 Helps combine multiple pieces of information (like splitting a large number into smaller
segments).
 Can improve the uniformity of random numbers by mixing parts of the number.
Example with More Complex Folding:
Let’s use the number 987654321 for a more complex folding example.
1. Split into parts:
o First part: 9876
o Second part: 54321
2. Sum the parts:
9876+54321=641979876 + 54321 = 641979876+54321=64197
3. Apply modulo (for a range of 0-9):
64197mod 10=764197 \mod 10 = 764197mod10=7
So, in this case, the folded result of 987654321 would be 7.

Conclusion:
 Folding is a method of breaking a number into parts and combining those parts in some
way (like adding them together) to produce a result.
 It is often used in random number generation and modular arithmetic for generating
results within a certain range.
 It’s a simple and effective method to reduce or combine numbers, but it can introduce
cycles or patterns if not carefully designed.

Algorithm:
The folding method is used for creating hash functions starts with the item being
11 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

divided into equal-sized pieces i.e., the last piece may not be of equal size.
The outcome of adding these bits together is the hash value, H(x) = (a + b + c)
mod M, where a, b, and c represent the preconditioned key broken down into
three parts and M is the table size, and mod stands for modulo.
In other words, the sum of three parts of the preconditioned key is divided by
the table size. The remainder is the hash key.
Explanation:

Example 1: The task is to fold the key 123456789 into a Hash


Table of ten spaces (0 through 9).
 It is given that the key, say X is 123456789 and the table size (i.e., M =
10).
 Since it can break X into three parts in any order. Let’s divide it evenly.
 Therefore, a = 123, b = 456, c = 789.
 Now, H(x) = (a + b + c) mod M i.e., H(123456789) =(123 + 456 + 789) mod
10 = 1368 mod 10 = 8.
 Hence, 123456789 is inserted into the table at address 8.

Example 2: The task is to fold the key 452378912 into a Hash


Table of ten spaces (0 through 9).
 It is given that the key, say X is 452378912 and the table size (i.e., M =
10).
 Since it can break X into three parts in any order. Let’s divide it evenly.
 Therefore, a = 452, b = 378, c = 912.
 Now, H(x) = (a + b + c) mod M i.e., H(452378912) =(452 + 378 + 912) mod
10 = 1742 mod 10 = 2.
 Hence, 452378912 is inserted into the table at address 2.

PROBLEMS ENCOUNTER IN HASHING


Collision
Let us consider the case when we have a single array with four records, each
with two fields, one for the key and one to hold data (we call this a single slot
bucket). Let the hashing function be a simple modulus operator i.e. array index
is computed by finding the remainder of dividing the key by 4.

Array Index: = key MOD 4


Then key values 9, 13, 17 will all hash to the same index. When two (or more)
keys hash to the same value, a collision is said to occur.

1. Collisions in Hash Functions:

In hashing algorithms (like those used for checking data integrity or in cryptographic applications), a
collision occurs when two different inputs (data, files, or numbers) result in the same hash value. Since
hash functions reduce data to a fixed-size output (hash), there's a possibility that two different inputs will
map to the same output.

Example of Collision in Hash Functions:

Let’s say we have a hash function H(x)H(x)H(x) that generates a 3-digit hash. For simplicity:
 H("apple")=123H("apple") = 123H("apple")=123
 H("banana")=123H("banana") = 123H("banana")=123

12 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

In this case, both "apple" and "banana" map to the same hash value 123, which is a collision. This means
that the hash function isn't perfectly distinguishing between these two different inputs.

2. Collisions in Modular Arithmetic:


In modular arithmetic, a collision happens when two different numbers produce the same result after
applying the modulus operation.
For example, in xmod 5x \mod 5xmod5, there could be multiple numbers that give the same remainder.
This means that different numbers collide in the modulus space.

Example of Collision in Modular Arithmetic:


Let’s find xmod 5x \mod 5xmod5 for two different numbers:
 7mod 5=27 \mod 5 = 27mod5=2
 12mod 5=212 \mod 5 = 212mod5=2
Here, both 7 and 12 give the same remainder (2) when divided by 5, which is a collision in the modulus
space.

3. Collisions in Random Number Generation:


In methods like mid-square or folding (as we discussed earlier), collisions can occur when two different
inputs (or seeds) generate the same output. This is a problem in random number generation, especially
when the number of possible outputs is smaller than the number of potential inputs.

Example of Collision in Random Number Generation:


Let’s say we are generating random numbers using a folding method. If we use seeds 12345 and 54321, we
might get the same result (after folding) for both numbers, resulting in a collision.

Why are Collisions a Problem?


Collisions are undesirable in many cases because:
1. Loss of Information: If different inputs produce the same result, you lose information about the
differences between those inputs.
2. Inconsistency: In hash functions, collisions can make it harder to guarantee data integrity because
two different inputs can appear to be the same.
3. Security Risk: In cryptographic applications, collisions in hash functions can be exploited to forge
signatures or break data integrity.

How to Minimize Collisions:


1. Increase the Range: In modular arithmetic or hashing, increasing the size of the output space
reduces the likelihood of collisions (e.g., using a larger modulus or a larger hash function).
2. Better Hash Functions: Cryptographic hash functions are designed to minimize collisions (e.g.,
SHA-256). They aim for a property called "collision resistance," which ensures that it's
computationally infeasible to find two different inputs that produce the same output.
3. Salting: In some applications (like password hashing), a salt (random value) is added to inputs to
prevent collisions from occurring when two identical inputs are hashed.

Conclusion:
 A collision in modular arithmetic occurs when two different numbers give the same remainder
after applying the modulus.
 In hashing, a collision happens when two different inputs produce the same hash value.
 Collisions are generally undesirable, as they reduce the ability to distinguish between different
inputs and can cause security or data integrity problems.

Collision Resolution

1. Open Addressing:

TEACHERS GUIDE:
Open Addressing is a method used in hash tables to resolve collisions—which occur when
two different keys hash to the same index. Instead of storing multiple keys at the same
index (like in chaining, where a linked list is used to store multiple values), open
addressing resolves collisions by finding another open slot within the hash table.
How Open Addressing Works:
13 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

In open addressing, when a collision occurs (i.e., two keys hash to the same index), the
hash table searches for the next available slot using a probing technique. There are
different types of probing strategies, but the key idea is that you probe the table in a
systematic way to find an open slot.
Key Steps in Open Addressing:
1. Hash the key to find an initial index.
2. If the slot at that index is occupied by the desired key, the insertion is successful.
3. If the slot is occupied by a different key (a collision), a probing strategy is used to find the
next available slot.
The probing techniques define how the next index is calculated when a collision occurs.
There are several common types of probing:

Types of Probing in Open Addressing:


1. Linear Probing:
Linear probing resolves collisions by checking the next index in the table sequentially
until an open slot is found.
o Formula for Linear Probing:
Next Index=(Current Index+i)mod Table Size\text{Next Index} = ( \text{Current Index} +
i ) \mod \text{Table Size}Next Index=(Current Index+i)modTable Size
where iii is the number of attempts made to find an open slot (starting with 0 and
increasing by 1 on each collision).
o Example: Let’s say we have a table of size 5 and we want to insert a key 20:
 We calculate the index of 20: 20mod 5=020 \mod 5 = 020mod5=0.
 If index 0 is already occupied, we check index 1, then index 2, and so on.
If we find an empty slot at index 2, we place the key 20 there.

2. Quadratic Probing:
Quadratic probing uses a quadratic function to calculate the next index. Instead of
linearly searching the next available slot, quadratic probing increases the distance
between probes in a quadratic manner.
o Formula for Quadratic Probing:
Next Index=(Current Index+i2)mod Table Size\text{Next Index} = (\text{Current Index}
+ i^2) \mod \text{Table Size}Next Index=(Current Index+i2)modTable Size
where iii is the number of attempts (starting with i=1i = 1i=1).
o Example: For a table of size 7 and a key 505050, let’s say we hash 50mod 7=150 \
mod 7 = 150mod7=1. If the slot at index 1 is taken, the next index checked will be:
(Index+12)mod 7=(1+1)mod 7=2(\text{Index} + 1^2) \mod 7 = (1 + 1) \mod 7 =
2(Index+12)mod7=(1+1)mod7=2
If index 2 is also taken, we check:
(Index+22)mod 7=(1+4)mod 7=5(\text{Index} + 2^2) \mod 7 = (1 + 4) \mod 7 =
5(Index+22)mod7=(1+4)mod7=5
This probing pattern continues quadratically.

3. Double Hashing:
Double hashing uses a second hash function to calculate the step size for probing, making
it more complex but potentially more efficient in reducing clustering.
o Formula for Double Hashing:
Next Index=(Current Index+i⋅h2(key))mod Table Size\text{Next Index} = (\text{Current
Index} + i \cdot h_2(\text{key})) \mod \text{Table Size}Next Index=(Current Index+i⋅h2
(key))modTable Size
where h2(key)h_2(\text{key})h2(key) is a second hash function, and iii is the number of
probes (attempts).
o Example: If the primary hash function gives an index of 3, and the second hash
function for the key gives h2(key)=2h_2(\text{key}) = 2h2(key)=2, we calculate the
next index using the formula:
14 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

Next Index=(3+i⋅2)mod Table Size\text{Next Index} = (3 + i \cdot 2) \mod \text{Table


Size}Next Index=(3+i⋅2)modTable Size
The first probe will be at index 3+1⋅2=53 + 1 \cdot 2 = 53+1⋅2=5, and the next probe will
be at index 3+2⋅2=73 + 2 \cdot 2 = 73+2⋅2=7, and so on.

Linear Probing: Search for an empty slot sequentially.


In linear probing, the hash table is searched sequentially that starts from the
original location of the hash. If in case the location that we get is already
occupied, then we check for the next location.
The function used for rehashing is as follows: rehash(key) = (n+1) %table-size.
For example, the typical gap between two probes is 1 as seen in the example
below:
Let hash(x) be the slot index computed using a hash function and S be the table
size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
Let us consider a simple hash function as “key mod 7” and a sequence of keys
as 50, 700, 76, 85, 92, 73, 101,
which means hash(key)= key% S, here S=size of the table =7, indexed from 0 to
6. We can define the hash function as per our choice if we want to create a hash
table, although it is fixed internally with a pre-defined formula.

Applications of linear probing:


Linear probing is a collision handling technique used in hashing, where the
algorithm looks for the next available slot in the hash table to store the collided
key. Some of the applications of linear probing include:

Symbol tables: Linear probing is commonly used in symbol tables, which are
used in compilers and interpreters to store variables and their associated
values. Since symbol tables can grow dynamically, linear probing can be used to
15 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

handle collisions and ensure that variables are stored efficiently.


Caching: Linear probing can be used in caching systems to store frequently
accessed data in memory. When a cache miss occurs, the data can be loaded
into the cache using linear probing, and when a collision occurs, the next
available slot in the cache can be used to store the data.
Databases: Linear probing can be used in databases to store records and their
associated keys. When a collision occurs, linear probing can be used to find the
next available slot to store the record.
Compiler design: Linear probing can be used in compiler design to implement
symbol tables, error recovery mechanisms, and syntax analysis.
Spell checking: Linear probing can be used in spell-checking software to store
the dictionary of words and their associated frequency counts. When a collision
occurs, linear probing can be used to store the word in the next available slot.
Overall, linear probing is a simple and efficient method for handling collisions
in hash tables, and it can be used in a variety of applications that require
efficient storage and retrieval of data.

Challenges in Linear Probing:


Primary Clustering: One of the problems with linear probing is Primary
clustering, many consecutive elements form groups and it starts taking time to
find a free slot or to search for an element.
TEACHERS GUIDE:
1. Primary Clustering is an issue that occurs in open addressing methods (specifically with
linear probing) when consecutive hash table slots become clustered together due to
collisions. This clustering can degrade the performance of the hash table, as it increases the
number of probes required to find an open slot when inserting or searching for elements.
How Primary Clustering Happens:
When using linear probing, if a collision occurs at a given index, the algorithm checks the
next slot, and so on, until an empty slot is found. As elements keep colliding, they tend to fill up
consecutive slots in the hash table. This group of consecutive filled slots is known as a cluster.
Example: Let us consider a simple hash function as “key mod 5” and a sequence of
keys that are to be inserted are 50, 70, 76, 93.
Step1: First draw the empty hash table which will have a possible range of hash
values from 0 to 4 according to the hash function provided.

Inserting Key 8:
 Hashing: 8mod 5=38 \mod 5 = 38mod5=3
 Insert key 8 at index 3.

16 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

Inserting Key 13:


 Hashing: 13mod 5=313 \mod 5 = 313mod5=3 (collision at index 3)
 Use linear probing to check the next index: 3+1=43 + 1 = 43+1=4.
 Insert key 13 at index 4.

Inserting Key 18:


 Hashing: 18mod 5=318 \mod 5 = 318mod5=3 (collision at index 3)
 Use linear probing: Check index 4, but it’s already occupied (collision).
 Check index 0 (wrap around), it’s empty. Insert key 18 at index 0.

Inserting Key 23:


 Hashing: 23mod 5=323 \mod 5 = 323mod5=3 (collision at index 3)
 Use linear probing: Check index 4 (occupied by 13), check index 0 (occupied by 18),
check index 1 (empty).
 Insert key 23 at index 1.

Resulting Table with Primary Clustering:


After inserting all the keys, we have the following table:

Notice how the cluster at the beginning of the table (indices 0 to 4) is growing. Whenever
there’s a collision at index 3, the probe sequence will expand the cluster of filled slots.
The issue is that the longer this sequence continues, the more the cluster grows. As a result:
1. Inserting new keys becomes slower because the probe sequence will likely have to
traverse through many occupied slots.
2. Searching for an element can also become slower because a larger cluster of slots must
be checked.

Why Primary Clustering is a Problem:


1. Decreased Performance: As the cluster grows, the number of probes required for
insertions and lookups increases significantly, leading to performance degradation.
2. Wasted Space: The growing cluster reduces the number of open spaces available for
future insertions, which can lead to more collisions and further exacerbate the problem.
3. Increased Load Factor: With primary clustering, you may find that the table fills up
much faster than expected because the table is inefficiently being used.

How to Mitigate Primary Clustering:


To address primary clustering, several strategies can be used to reduce the likelihood of
collisions, especially with linear probing.
17 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

1. Quadratic Probing: Instead of checking consecutive slots, quadratic probing uses a


quadratic formula to jump to the next slot, reducing the chance of forming large
clusters.
o Formula: Next Index=(Current Index+i2)mod Table Size\text{Next Index} = (\
text{Current Index} + i^2) \mod \text{Table
Size}Next Index=(Current Index+i2)modTable Size, where iii is the number of
probes (starting from 1 and increasing).
This reduces the size of clusters, as the gap between probes increases, preventing long
sequences of filled consecutive slots.
2. Double Hashing: A more complex method is double hashing, where a second hash
function is used to calculate the probe step. This is less likely to form primary clusters
because it generates a more random pattern of probe indices.
o Formula: Next Index=(Current Index+i⋅h2(key))mod Table Size\text{Next
Index} = (\text{Current Index} + i \cdot h_2(\text{key})) \mod \text{Table
Size}Next Index=(Current Index+i⋅h2(key))modTable Size, where h2(key)h_2(\
text{key})h2(key) is a second hash function.
3. Rehashing: When the table becomes too full (often when the load factor exceeds a
certain threshold), the table size can be increased (rehashing), and the elements are
rehashed into the new table. This can help break up existing clusters and distribute
elements more evenly.
4. Better Table Sizing: Ensuring that the table size is a prime number can help avoid
some clustering issues, especially with methods like quadratic probing and double
hashing. This is because prime numbers reduce the likelihood of patterns forming in the
probe sequences.

Conclusion:
Primary Clustering is a common problem in open addressing (especially with linear
probing) that occurs when collisions cause consecutive slots to be filled, forming a large
cluster. This leads to slower insertion and search operations as the cluster grows. To mitigate
primary clustering, techniques like quadratic probing, double hashing, and rehashing are
commonly used to reduce the likelihood of consecutive slots being filled and improve overall
hash table performance.

2. Primary Clustering refers to the situation where these clusters of occupied slots keep
expanding in the table. If a new element hashes to an index that falls within or near an
existing cluster, it will be forced into this growing cluster, further increasing the number of
probes needed.
In essence, primary clustering happens because linear probing is not random enough to
avoid these clusters. The more collisions occur, the larger the cluster becomes, and the
slower the search and insertion operations will be.

TEACHERS GUIDE:
Example of Primary Clustering:
Consider a hash table of size 5 (indices 0 to 4), and let’s use linear probing to resolve
collisions. We'll insert the keys 8, 13, 18, and 23 into the hash table, and demonstrate how
primary clustering can develop.
Table Size: 5
 Initial Table:

Secondary Clustering: Secondary clustering is less severe; two records only


have the same collision chain (Probe Sequence) if their initial position is the
same.

TEACHERS GUIDE:

Secondary Clustering is another issue that arises in open addressing when probing for
18 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

available slots in a hash table. Unlike primary clustering, which occurs specifically with linear
probing, secondary clustering occurs when quadratic probing or double hashing is used. It
refers to the phenomenon where, despite using a non-linear probing strategy (like quadratic
probing or double hashing), keys that hash to the same index still follow the same probe
sequence, causing the formation of clusters.

What is Secondary Clustering?


Secondary clustering happens when:
 Two or more keys hash to the same index.
 Even with a non-linear probing strategy, such as quadratic probing or double hashing,
those keys still follow the same probing sequence.
 This leads to a situation where keys that initially collide continue to follow the same
pattern of probes, causing them to be placed in the same or nearby positions.

In essence, while primary clustering involves large clusters due to linear sequential probing,
secondary clustering involves keys following a similar probe pattern due to the probing
function, even when quadratic probing or double hashing is used.

Difference Between Primary and Secondary Clustering:


 Primary Clustering happens with linear probing and is caused by consecutive occupied slots
being checked for collisions, forming large clusters of filled slots.
 Secondary Clustering happens with quadratic probing or double hashing, and occurs when
keys that hash to the same index follow the same probe sequence (even though the probing
method isn’t linear). This leads to smaller but still undesirable clusters.

How Secondary Clustering Works:


Let’s look at quadratic probing and how secondary clustering can form.
Example of Secondary Clustering with Quadratic Probing:
Consider a hash table with 5 slots and a quadratic probing strategy. Quadratic probing works by
adding i2i^2i2 (where iii is the number of attempts or probes) to the index when a collision occurs.
 Formula for quadratic probing:
Next Index=(Current Index+i2)mod Table Size\text{Next Index} = (\text{Current Index} + i^2) \mod \
text{Table Size}Next Index=(Current Index+i2)modTable Size
Let’s insert the following keys: 20, 30, and 40, using a hash function where the hash value is determined
by Keymod Table Size\text{Key} \mod \text{Table Size}KeymodTable Size.
Table Size: 5
 Initial Table:

Inserting Key 20:


 Hashing: 20mod 5=020 \mod 5 = 020mod5=0
 Insert key 20 at index 0.

Inserting Key 30:


 Hashing: 30mod 5=030 \mod 5 = 030mod5=0 (collision at index 0)
 Quadratic Probing: Try (0+12)mod 5=1(0 + 1^2) \mod 5 = 1(0+12)mod5=1.
 Insert key 30 at index 1.

19 | P a g e
Levi John A. Bernisto,
MIS
COMSC 3

Inserting Key 40:


 Hashing: 40mod 5=040 \mod 5 = 040mod5=0 (collision at index 0)
 Quadratic Probing: Try (0+12)mod 5=1(0 + 1^2) \mod 5 = 1(0+12)mod5=1, but index 1 is
occupied.
 Quadratic Probing: Try (0+22)mod 5=4(0 + 2^2) \mod 5 = 4(0+22)mod5=4.
 Insert key 40 at index 4.

Now, let's analyze what's happening:


 Keys 20 and 30 have collided at index 0 and follow the same probe sequence using quadratic
probing.
 When 40 tries to be inserted, it also collides with index 0 and follows the same probe sequence,
which leads it to index 4.
In this case, even though we're using quadratic probing, keys 20, 30, and 40 form a sequence that leads
them into different slots that are still near each other. This creates a cluster of occupied slots, and the
probing continues to form similar patterns for these keys. This is secondary clustering because the
sequence of probes is still being influenced by the initial collision at index 0.

Another Example: Let us consider a simple hash function as “key mod 5” and a
sequence of keys that are to be inserted are 50, 70, 76, 93.
Step1: First draw the empty hash table which will have a possible range of hash
values from 0 to 4 according to the hash function provided.

Hash table
Step 2: Now insert all the keys in the hash table one by one. The first key is 50.
It will map to slot number 0 because 50%5=0. So insert it into slot number 0.

110 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Insert 50 into hash table


Step 3: The next key is 70. It will map to slot number 0 because 70%5=0 but 50
is already at slot number 0 so, search for the next empty slot and insert it.

Insert 70 into hash table


Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70
is already at slot number 1 so, search for the next empty slot and insert it.

111 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Insert 76 into hash table


Step 5: The next key is 93 It will map to slot number 3 because 93%5=3, So
insert it into slot number 3.

Insert 93 into hash table

2. Quadratic Probing

If you observe carefully, then you will understand that the interval between
probes will increase proportionally to the hash value. Quadratic probing is a
method with the help of which we can solve the problem of clustering that was
discussed above. This method is also known as the mid-square method. In this
method, we look for the i2‘th slot in the ith iteration. We always start from the
original hash location. If only the location is occupied then we check the other
slots.
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and
collision resolution strategy to be f(i) = i2 . Insert = 22, 30, and 50.
Step 1: Create a table of size 7.

112 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Hash table
Step 2 – Insert 22 and 30
Hash (22) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily insert
22 at slot 1.
Hash (30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert
30 at slot 2.

Insert keys 22 and 30 in the hash table


Step 3: Inserting 50
Hash(50) = 50 % 7 = 1

113 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

In our hash table slot 1 is already occupied. So, we will search for slot 1+i2, i.e.
1+1 = 2,

Again slot 2 is found occupied, so we will search for cell 1+i2, i.e.1+4 = 5,

1+22=1+4=5
Now, cell 5 is not occupied so we will place 50 in slot 5.

Insert key 50 in the hash table

Quadratic Probing: Search for an empty slot using a quadratic function


Quadratic probing is an open-addressing scheme where we look for the i2‘th
slot in the i’th iteration if the given hash value x collides in the hash table.

How Quadratic Probing is done?


Let hash(x) be the slot index computed using the hash function.
If the slot hash(x) % S is full, then we try (hash(x) + 1*1) % S.
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S.
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S.
This process is repeated for all the values of i until an empty slot is found.

For example: Let us consider a simple hash function as “key mod 7” and
sequence of keys as 50, 700, 76, 85, 92, 73, 101

114 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Closed Addressing:
Chaining: Store colliding keys in a linked list at each index.
The idea behind separate chaining is to implement the array as a linked list
called a chain. Separate chaining is one of the most popular and commonly
used techniques in order to handle collisions.

The linked list data structure is used to implement this technique. So what
happens is, when multiple elements are hashed into the same slot index, then
these elements are inserted into a singly-linked list which is known as a chain.

Here, all those elements that hash into the same slot index are inserted into a
linked list. Now, we can use a key K to search in the linked list by just linearly
traversing. If the intrinsic key for any entry is equal to K then it means that we
have found our entry. If we have reached the end of the linked list and yet we
haven’t found our entry then it means that the entry does not exist. Hence, the
conclusion is that in separate chaining, if two different elements have the same
hash value then we store both the elements in the same linked list one after the
other.

Example: Let us consider a simple hash function as “key mod 7” and a sequence
of keys as 50, 700, 76, 85, 92, 73, 101

115 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Advantages:

Simple to implement.

Hash table never fills up, we can always add more elements to the chain.
Less sensitive to the hash function or load factors.
It is mostly used when it is unknown how many and how frequently keys may
be inserted or deleted.

Disadvantages:

The cache performance of chaining is not good as keys are stored using a linked
list. Open addressing provides better cache performance as everything is
stored in the same table.

Wastage of Space (Some Parts of the hash table are never used)

If the chain becomes long, then search time can become O(n) in the worst case
Uses extra space for links

Performance of Chaining:

Performance of hashing can be evaluated under the assumption that each key is
equally likely to be hashed to any slot of the table (simple uniform hashing).

m = Number of slots in hash table


n = Number of keys to be inserted in hash table

116 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Cuckoo Hashing: Use multiple hash functions to distribute keys


Cuckoo hashing applies the idea of multiple-choice and relocation together and
guarantees O(1) worst case lookup time!

Multiple-choice: We give a key two choices the h1(key) and h2(key) for residing.

Relocation: It may happen that h1(key) and h2(key) are preoccupied. This is
resolved by imitating the Cuckoo bird: it pushes the other eggs or young out of
the nest when it hatches. Analogously, inserting a new key into a cuckoo hashing
table may push an older key to a different location. This leaves us with the
problem of re-placing the older key.

If the alternate position of older key is vacant, there is no problem.

Otherwise, the older key displaces another key. This continues until the
procedure finds a vacant position, or enters a cycle. In the case of a cycle, new
hash functions are chosen and the whole data structure is ‘rehashed’. Multiple
rehashes might be necessary before Cuckoo succeeds.

Insertion is expected O(1) (amortized) with high probability, even considering the
possibility of rehashing, as long as the number of keys is kept below half of the
capacity of the hash table, i.e., the load factor is below 50%.

Deletion is O(1) worst-case as it requires inspection of just two locations in the


hash

Illustration
Input:

{20, 50, 53, 75, 100, 67, 105, 3, 36, 39}

Hash Functions:

h1(key) = key%11

h2(key) = (key/11) %11

Let’s start by inserting 20 at its possible position in the first table determined by
h1(20):

Next: 50

117 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Next: 53. h1(53) = 9. But 20 is already there at 9. We place 53 in table 1 & 20 in


table 2 at h2(20)

Next: 75. h1(75) = 9. But 53 is already there at 9. We place 75 in table 1 & 53 in


table 2 at h2(53)

Next: 100. h1(100) = 1.

Next: 67. h1(67) = 1. But 100 is already there at 1. We place 67 in table 1 & 100
in table 2

Next: 105. h1(105) = 6. But 50 is already there at 6. We place 105 in table 1 &
50 in table 2 at h2(50) = 4. Now 53 has been displaced. h1(53) = 9. 75 displaced:
h2(75) = 6.

Next: 3. h1(3) = 3.

118 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

Next: 36. h1(36) = 3. h2(3) = 0.

Next: 39. h1(39) = 6. h2(105) = 9. h1(100) = 1. h2(67) = 6. h1(75) = 9. h2(53) = 4.


h1(50) = 6. h2(39) = 3.
Here, the new key 39 is displaced later in the recursive calls to place 105, which
it displaced.

119 | P a g
e
Levi John A. Bernisto,
MIS
COMSC 3

You might also like