Hash Tables
Reference
Introduction to Algorithms
by
T.H. Cormen
C.E. Leiserson
R.L. Rivest
C. Stein
Hash Tables
• Motivation: symbol tables
– A compiler uses a symbol table to relate symbols
to associated data
• Symbols: variable names, procedure names, etc.
• Associated data: memory location, call graph, etc.
– For a symbol table (also called a dictionary), we
care about search, insertion, and deletion
– We typically don’t care about sorted order
Hash Tables
• More formally:
– Given a table T and a record x, with key (= symbol)
and satellite data, we need to support:
• Insert (T, x)
• Delete (T, x)
• Search(T, x)
– We want these to be fast, but don’t care about
sort the records
• The structure we will use is a hash table
– Supports all the above in O(1) expected time!
Hashing: Keys
• In the following discussions we will consider
all keys to be (possibly large) natural numbers
• How can we convert floats to natural numbers
for hashing purposes?
• How can we convert ASCII strings to natural
numbers for hashing purposes?
Direct Addressing
• Suppose:
– The range of keys is 0..m-1
– Keys are distinct
• The idea:
– Set up an array T[0..m-1] in which
• T[i] = x if x T and key[x] = i
• T[i] = NULL otherwise
– This is called a direct-address table
• Operations take O(1) time!
• So what’s the problem?
The Problems
• Direct addressing works well when the range
m of keys is relatively small
• But what if the keys are 32-bit integers?
– Problem 1: direct-address table will have
232 entries, more than 4 billion
– Problem 2: even if memory is not an issue, the
time to initialize the elements to NULL may be
• Solution: map keys to smaller range 0..m-1
• This mapping is called a hash function
Hash Functions
• Next problem: collision
T
U 0
(universe of keys)
h(k1)
k1
h(k4)
K k4
(actual k5
h(k2) = h(k5)
keys)
k2 h(k3)
k3
m-1
Resolving Collisions
• How can we solve the problem of collisions?
• Solution 1: chaining
• Solution 2: open addressing
Open Addressing
• Basic idea:
– To insert: if slot is full, try another slot, …, until an
open slot is found (probing)
– To search, follow same sequence of probes as would
be used when inserting the element
• If reach element with correct key, return it
• If reach a NULL pointer, element is not in table
• Good for fixed sets (adding but no deletion)
– Example: spell checking
• Table needn’t be much bigger than n
Chaining
• Chaining puts elements that hash to the same
slot in a linked list: T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 ——
k8 k3
k6
k8 k6 ——
——
Chaining
• Chaining puts elements that hash to the same
slot in a linked list: T
U ——
(universe of keys) k1 k4 ——
——
k1 ——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining
• How do we insert an element?
T
U ——
(universe of keys) k1 k4 ——
——
k1 ——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining
• How do we delete an element?
T
U ——
(universe of keys) k1 k4 ——
——
k1 ——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Chaining
• How do we search for a element with a
given key? T
U ——
(universe of keys) k1 k4 ——
——
k1 ——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
Analysis of Chaining
• Assume simple uniform hashing: each key in
table is equally likely to be hashed to any slot
• Given n keys and m slots in the table: the
load factor = n/m = average # keys per slot
• What will be the average cost of an
unsuccessful search for a key?
Analysis of Chaining
• Assume simple uniform hashing: each key in
table is equally likely to be hashed to any slot
• Given n keys and m slots in the table, the
load factor = n/m = average # keys per slot
• What will be the average cost of an
unsuccessful search for a key? A: O(1+)
Analysis of Chaining
• Assume simple uniform hashing: each key in
table is equally likely to be hashed to any slot
• Given n keys and m slots in the table, the
load factor = n/m = average # keys per slot
• What will be the average cost of an
unsuccessful search for a key? A: O(1+)
• What will be the average cost of a successful
search?
Analysis of Chaining
• Assume simple uniform hashing: each key in
table is equally likely to be hashed to any slot
• Given n keys and m slots in the table, the
load factor = n/m = average # keys per slot
• What will be the average cost of an
unsuccessful search for a key? A: O(1+)
• What will be the average cost of a successful
search? A: O(1 + /2) = O(1 + )
Analysis of Chaining
• So the cost of searching = O(1 + )
• If the number of keys n is proportional to the
number of slots in the table, what is ?
• A: = O(1)
– In other words, we can make the expected cost of
searching constant if we make constant
Choosing A Hash Function
• Clearly choosing the hash function well is
crucial
– What will a worst-case hash function do?
– What will be the time to search in this case?
• What are desirable features of the hash
function?
– Should distribute keys uniformly into slots
– Should not depend on patterns in the data
Hash Functions:
The Division Method
• h(k) = k mod m
– In words: hash k into a table with m slots using the
slot given by the remainder of k divided by m
• What happens to elements with adjacent
values of k?
• What happens if m is a power of 2 (say 2P)?
• What if m is a power of 10?
• Upshot: pick table size m = prime number not
too close to a power of 2 (or 10)
Hash Functions:
The Multiplication Method
• For a constant A, 0 < A < 1:
• h(k) = m (kA - kA)
What does this term represent?
Hash Functions:
The Multiplication Method
• For a constant A, 0 < A < 1:
• h(k) = m (kA - kA)
Fractional part of kA
• Choose m = 2P
• Choose A not too close to 0 or 1
• Knuth: Good choice for A = (5 - 1)/2
Data Structures
11/08/2022 25
HEAPS
11/08/2022 26
Heaps
• A heap can be seen as a complete binary tree:
16
14 10
8 7 9 3
2 4 1
– What makes a binary tree complete?
– Is the example above complete?
11/08/2022 27
Heaps
• A heap can be seen as a complete binary tree:
16
14 10
8 7 9 3
2 4 1 1 1 1 1 1
– The book calls them “nearly complete” binary
trees; can think of unfilled slots as null pointers
11/08/2022 28
Heaps
• In practice, heaps are usually implemented as
arrays:
16
14 10
8 7 9 3
A = 16 14 10 8 7 9 3 2 4 1 =
2 4 1
11/08/2022 29
Heaps
• To represent a complete binary tree as an
array:
– The root node is A[1]
– Node i is A[i]
– The parent of node i is A[i/2] (note: integer divide)
– The left child of node i is A[2i]
– The right child of node i is A[2i + 1] 16
14 10
8 7 9 3
A = 16 14 10 8 7 9 3 2 4 1 =
2 4 1
11/08/2022 30
Referencing Heap Elements
• So…
Parent(i) { return i/2; }
Left(i) { return 2*i; }
right(i) { return 2*i + 1; }
• An aside: How would you implement this
most efficiently?
• Another aside: Really?
11/08/2022 31
The Heap Property
• Heaps also satisfy the heap property:
A[Parent(i)] A[i] for all nodes i > 1
– In other words, the value of a node is at most the
value of its parent
– Where is the largest element in a heap stored?
• Definitions:
– The height of a node in the tree = the number of
edges on the longest downward path to a leaf
– The height of a tree = the height of its root
11/08/2022 32
Heap Height
• What is the height of an n-element heap?
Why?
• This is nice: basic heap operations take at
most time proportional to the height of the
heap
11/08/2022 33
Heap Operations: Heapify()
• Heapify(): maintain the heap property
– Given: a node i in the heap with children l and r
– Given: two subtrees rooted at l and r, assumed to
be heaps
– Problem: The subtree rooted at i may violate the
heap property (How?)
– Action: let the value of the parent node “float
down” so subtree at i satisfies the heap property
• What do you suppose will be the basic operation
between i, l, and r?
11/08/2022 34
Heap Operations: Heapify()
Heapify(A, i)
{
l = Left(i); r = Right(i);
if (l <= heap_size(A) && A[l] > A[i])
largest = l;
else
largest = i;
if (r <= heap_size(A) && A[r] > A[largest])
largest = r;
if (largest != i)
Swap(A, i, largest);
Heapify(A, largest);
}
11/08/2022 35
Heapify() Example
16
4 10
14 7 9 3
2 8 1
A = 16 4 10 14 7 9 3 2 8 1
11/08/2022 36
Heapify() Example
16
4 10
14 7 9 3
2 8 1
A = 16 4 10 14 7 9 3 2 8 1
11/08/2022 37
Heapify() Example
16
4 10
14 7 9 3
2 8 1
A = 16 4 10 14 7 9 3 2 8 1
11/08/2022 38
Heapify() Example
16
14 10
4 7 9 3
2 8 1
A = 16 14 10 4 7 9 3 2 8 1
11/08/2022 39
Heapify() Example
16
14 10
4 7 9 3
2 8 1
A = 16 14 10 4 7 9 3 2 8 1
11/08/2022 40
Heapify() Example
16
14 10
4 7 9 3
2 8 1
A = 16 14 10 4 7 9 3 2 8 1
11/08/2022 41
Heapify() Example
16
14 10
8 7 9 3
2 4 1
A = 16 14 10 8 7 9 3 2 4 1
11/08/2022 42
Heapify() Example
16
14 10
8 7 9 3
2 4 1
A = 16 14 10 8 7 9 3 2 4 1
11/08/2022 43
Heapify() Example
16
14 10
8 7 9 3
2 4 1
A = 16 14 10 8 7 9 3 2 4 1
11/08/2022 44
Analyzing Heapify(): Informal
• Aside from the recursive call, what is the
running time of Heapify()?
• How many times can Heapify() recursively
call itself?
• What is the worst-case running time of
Heapify() on a heap of size n?
11/08/2022 45
Analyzing Heapify(): Formal
• Fixing up relationships between i, l, and r
takes (1) time
• If the heap at i has n elements, how many
elements can the subtrees at l or r have?
– Draw it
• Answer: 2n/3 (worst case: bottom row 1/2
full)
• So time taken by Heapify() is given by
T(n) T(2n/3) + (1)
11/08/2022 46
2/3 bound
• In the worst case, for the heap with i internal
nodes and r leaves,
• 2i = r + i – 1
Fraction of nodes in left subtree =
• => r = i + 1 (2m+1)/(3m+2) ≤ 2/3
m m
m +1
11/08/2022 47
Analyzing Heapify(): Formal
• So we have
T(n) T(2n/3) + (1)
• By case 2 of the Master Theorem,
T(n) = O(lg n)
• Thus, Heapify() takes logarithmic time
11/08/2022 48
Heap Operations: BuildHeap()
• We can build a heap in a bottom-up manner
by running Heapify() on successive
subarrays
– Fact: for array of length n, all elements in range
A[n/2 + 1 .. n] are heaps (Why?)
– So:
• Walk backwards through the array from n/2 to 1, calling
Heapify() on each node.
• Order of processing guarantees that the children of
node i are heaps when i is processed
11/08/2022 49
BuildHeap()
// given an unsorted array A, make A a heap
BuildHeap(A)
{
heap_size(A) = length(A);
for (i = length[A]/2 downto 1)
Heapify(A, i);
}
11/08/2022 50
BuildHeap() Example
• Work through example
A = {4, 1, 3, 2, 16, 9, 10, 14, 8, 7}
1 3
2 16 9 10
14 8 7
11/08/2022 51
Analyzing BuildHeap()
• Each call to Heapify() takes O(lg n) time
• There are O(n) such calls (specifically, n/2)
• Thus the running time is O(n lg n)
– Is this a correct asymptotic upper bound?
– Is this an asymptotically tight bound?
• A tighter bound is O(n)
– How can this be? Is there a flaw in the above
reasoning?
11/08/2022 52
Analyzing BuildHeap(): Tight
• To Heapify() a subtree takes O(h) time
where h is the height of the subtree
– h = O(lg m), m = # nodes in subtree
– The height of most subtrees is small
• Fact: an n-element heap has at most n/2h+1
nodes of height h
• Can be used to prove that BuildHeap()
takes O(n) time
11/08/2022 53
Heapsort
• Given BuildHeap(), an in-place sorting
algorithm is easily constructed:
– Maximum element is at A[1]
– Discard by swapping with element at A[n]
• Decrement heap_size[A]
• A[n] now contains correct value
– Restore heap property at A[1] by calling
Heapify()
– Repeat, always swapping A[1] for A[heap_size(A)]
11/08/2022 54
Heapsort
Heapsort(A)
{
BuildHeap(A);
for (i = length(A) downto 2)
{
Swap(A[1], A[i]);
heap_size(A) -= 1;
Heapify(A, 1);
}
}
11/08/2022 55
Analyzing Heapsort
• The call to BuildHeap() takes O(n) time
• Each of the n - 1 calls to Heapify() takes
O(lg n) time
• Thus the total time taken by HeapSort()
= O(n) + (n - 1) O(lg n)
= O(n) + O(n lg n)
= O(n lg n)
11/08/2022 56
Problems
• 1. Design an algorithm to merge k large sorted
arrays of a total of n elements using no more
than O(k) additional storage. Here k << n.
• 2. Design an algorithm which reads in n
integers one by one and then returns the k
smallest ones. Here k << n and n is unknown
to start with.
11/08/2022 57
Priority Queues
• Heapsort is a nice algorithm, but in practice
Quicksort usually wins
• But the heap data structure is incredibly
useful for implementing priority queues
– A data structure for maintaining a set S of
elements, each with an associated value or key
– Supports the operations Insert(),
Maximum(), and ExtractMax()
– What might a priority queue be useful for?
11/08/2022 58
Priority Queue Operations
• Insert(S, x) inserts the element x into set S
• Maximum(S) returns the element of S with
the maximum key
• ExtractMax(S) removes and returns the
element of S with the maximum key
• How could we implement these operations
using a heap?
11/08/2022 59
Example of Insertion to Max Heap
• Add at the end, move upwards
20 20 21
15 2 15 5 15 20
14 10 14 10 2 14 10 2
initial location of new node insert 5 into heap insert 21 into heap
11/08/2022 60
Example of Deletion from Max Heap
• EXTRACT_MAX
remove
20 10 15
15 2 15 2 14 2
14 10 14 10
11/08/2022 61