Maps
1
Maps
A map models a searchable collection of key-value
entries
The main operations of a map are for searching,
inserting, and deleting items
Multiple entries with the same key are not allowed
Applications:
address book (yellowpage)
student-record database
2
Entry ADT
An entry stores a key-value pair (k,v)
Methods:
key(): return the associated key
value(): return the associated value
setKey(k): set the key to k
setValue(v): set the value to v
We call this “item” or “element” or “record” exchangeably.
Then, MAP stores multiple a collection of Entries
3
The Map ADT
find(k): if the map M has an entry with key k, return and iterator
to it; else, return special iterator end
put(k, v): if there is no entry with key k, insert entry
(k, v), and otherwise set its value to v. Return an iterator to the
new/modified entry
erase(k): if the map M has an entry with key k, remove it from M
size(), empty()
begin(), end(): return iterators to beginning and end of M
Important Issue?
Using what data structure and algorithm, we implement Map?
4
Example
Operation Output Map
empty() true Ø
put(5,A) [(5,A)] (5,A)
put(7,B) [(7,B)] (5,A),(7,B)
put(2,C) [(2,C)] (5,A),(7,B),(2,C)
put(8,D) [(8,D)] (5,A),(7,B),(2,C),(8,D)
put(2,E) [(2,E)] (5,A),(7,B),(2,E),(8,D)
find(7) [(7,B)] (5,A),(7,B),(2,E),(8,D)
find(4) end (5,A),(7,B),(2,E),(8,D)
find(2) [(2,E)] (5,A),(7,B),(2,E),(8,D)
size() 4 (5,A),(7,B),(2,E),(8,D)
erase(5) — (7,B),(2,E),(8,D)
erase(2) — (7,B),(8,D)
find(2) end (7,B),(8,D)
empty() false (7,B),(8,D)
5
map in C++
6
A Simple List-Based Map
An easiest way of implementing Map
We can efficiently implement a map using an unsorted list
We store the items of the map in a list S (based on a doubly-
linked list), in arbitrary order
header nodes/positions trailer
9 c 6 c 5 c 8 c
entries
7
The find Algorithm
Algorithm find(k):
for each p in [S.begin(), S.end()) do
if pkey() = k then
return p
return S.end() {there is no entry with key equal to k}
We use pkey() as a
shortcut for (*p).key()
8
The put Algorithm
Algorithm put(k,v):
for each p in [S.begin(), S.end()) do
if pkey() = k then
psetValue(v)
return p
p = S.insertBack((k,v)) {there is no entry with key k}
n=n+1 {increment number of entries}
return p
9
The erase Algorithm
Algorithm erase(k):
for each p in [S.begin(), S.end()) do
if p.key() = k then
S.erase(p)
n=n–1 {decrement number of
entries}
10
Performance of a List-Based Map
Performance:
put takes O(n) time since we need to determine whether it is already in
the sequence
find and erase take O(n) time since in the worst case (the item is not
found) we traverse the entire sequence to look for an item with the given
key
The unsorted list implementation is effective only for maps of
small size or for maps in which puts are the most common
operations, while searches and removals are rarely performed
(e.g., historical record of logins to a workstation)
Can we improve?
11
Hash Tables
0
1 025-612-0001
2 981-101-0002
3
4 451-229-0004
12
Recall the Map ADT
find(k): if the map M has an entry with key k, return its
associated value; else, return null
put(k, v): insert entry (k, v) into the map M; if key k is not
already in M, then return null; else, return old value
associated with k
erase(k): if the map M has an entry with key k, remove it
from M and return its associated value; else, return null
size(), empty()
entrySet(): return a list of the entries in M
keySet(): return a list of the keys in M
values(): return a list of the values in M
13
What about this idea?
We design a table for a map
storing entries as (SSN, Name),
where SSN (social security 0
number) is a nine-digit positive 1 025-612-0001
integer 2 981-101-0002
3
4 451-229-0004
Our table uses an array of size
…
N = 10,000 and classify each
person based on the last four 9997
digits of his/her SSN 9998 200-751-9998
9999
Do you think that we can
speed up searching using this
method?
14
Hash Table: Overview
Use key as an “address” for a value
Worst-case performance: still O(n)
But, usually expected performance: O(1)
Practically very fast
Consists of two major components
1. Bucket array
2. Hash function
15
Bucket Array
An array of size N, where each cell Bucket
of A is “bucket”
0
1 025-612-0001
Two Issues 2 981-101-0002
How to choose the bucket size N? 3
Large N? 4 451-229-0004
…
Small N?
Keys should be integers, but in 9997
practice, not always. 9998 200-751-9998
Key: string “yiyung” 9999
16
Hash Functions and Hash Tables
A hash function h maps keys of a given type to integers in a fixed
interval [0, N - 1]
Example:
h(x) = x mod N
is a hash function for integer keys
The integer h(x) is called the hash value of key x
A hash table for a given key type consists of
Hash function h
Array (called table or bucket array) of size N
When implementing a map with a hash table, the goal is to store
item (k, o) at index i = h(k)
17
Hash Functions
A hash function is usually specified as The hash code is applied first, and the
the composition of two functions: compression function is applied next on
Hash code: the result, i.e.,
h1: keys integers h(x) = h2(h1(x))
Compression function:
h2: integers [0, N - 1] The goal of the hash function is to
“disperse” the keys in an apparently
random way
(Note) Keys can be arbitrary objects, e.g.,
string “yiyung”
18
Hash Code and Compress Function
There are extensive theoretical and experiment research
about “good” hash code and compress functions
In the next 3 slides,
We will discuss some basic hash codes and compress
functions.
Looking at their more details is not the beyond of our scope.
19
(1) Hash Codes
Memory address:
We reinterpret the memory Component sum:
address of the key object as We partition the bits of the key
an integer into components of fixed length
(e.g., 16 or 32 bits) and we sum
the components (ignoring
Integer cast: overflows)
We reinterpret the bits of the
key as an integer Suitable for numeric keys of
Suitable for keys of length less fixed length greater than or
equal to the number of bits of
than or equal to the number
the integer type (e.g., long and
of bits of the integer type
double in C++)
(e.g., byte, short, int and float But, not good for strings
in C++) “temp01” and “temp10”
“stop”, “tops”, “pots”, “spot”
20
Polynomial Hash Code
Polynomial accumulation: Polynomial p(z) can be
We partition the bits of the evaluated in O(n) time using
key into a sequence of
components of fixed length Horner’s rule:
(e.g., 8, 16 or 32 bits) The following polynomials are
a0 a1 … an-1 successively computed, each
We evaluate the polynomial from the previous one in O(1)
p(z) = a0 + a1 z + a2 z2 + … time
… + an-1zn-1 p0(z) = an-1
at a fixed value z, ignoring pi (z) = an-i-1 + zpi-1(z)
overflows (i = 1, 2, …, n -1)
Especially suitable for strings
We have p(z) = pn-1(z)
(e.g., the choice z = 33 gives
at most 6 collisions on a set of
50,000 English words) Lots of research about “good
hash code”
21
(2) Compression Functions
Division: Multiply, Add and Divide
h2 (y) = |y| mod N (MAD):
The size N of the hash h2 (y) = |ay + b| mod N
table is usually chosen to a and b are nonnegative
be a prime integers such that
a mod N 0
Otherwise, every integer
The reason has to do
would map to the same
with number theory and value b
is beyond the scope of
this course
22
Collision Handling
0 Insert the entry: 032-637-0004
1 025-612-0001
2 981-101-0002
3
4 451-229-0004
…
9997
9998 200-751-9998
9999
Collisions occur when different
elements are mapped to the
same cell
Collision Handling
Separate chaining is simple,
Separate Chaining: let but requires additional
each cell in the table memory outside the table
point to a linked list of
entries that map there
h(y) = y mod 13
24
Open Addressing: Linear Probing
Open addressing: the colliding item is placed in a different cell of the table
Linear probing: handles collisions by placing the colliding item in the next
(circularly) available table cell
Each table cell inspected is referred to as a “probe”
Colliding items lump together, causing future collisions to cause a longer
sequence of probes
h(x) = x mod 11
25
Linear Probing: Example
Example:
h(x) = x mod 13
Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order
0 1 2 3 4 5 6 7 8 9 10 11 12
41 18 44 59 32 22 31 73
0 1 2 3 4 5 6 7 8 9 10 11 12
26
Search with Linear Probing
Consider a hash table A that
uses linear probing Algorithm find(k)
find(k) i h(k)
p0
We start at cell h(k)
repeat
We probe consecutive
c A[i]
locations until one of the
following occurs if c =
An item with key k is found, return
or
null
An empty cell is found, or else if c.key () =
k
N cells have been
unsuccessfully probed return
c.value()
else
i (i +
1) mod N
pp+1 27
Updates with Linear Probing
To handle insertions and put(k, o)
deletions, we introduce a special We throw an exception if the
marker, called AVAILABLE, which table is full
replaces deleted elements We start at cell h(k)
Avoids a lot of shift operations
We probe consecutive cells until
one of the following occurs
erase(k) A cell i is found that is either
We search for an entry with key empty or stores AVAILABLE, or
k N cells have been
If such an entry (k, o) is found, unsuccessfully probed
we replace it with the special
item AVAILABLE and we return We store (k, o) in cell i
element o
Else, we return null
28
Other Issues
Search with Linear Probing
Clustering problem
Other open addressing method
Quadratic Probing, Double Hashing (the details in the book)
The load factor a = n/N affects the performance of a
hash table
Keeping the load factor below a certain threshold is vital
Open addressing (requires a < 0.5)
Separate-chaining (requires a < 0.9)
Resize the hash table, i.e., rehashing a new table
29
Performance of Hashing
In the worst case, searches,
insertions and removals on a hash
table take O(n) time
But, when well designed, the
expected running time of all the
The worst case occurs when all MAP ADT operations in a hash table
the keys inserted into the map is O(1)
collide
In practice, hashing is very fast
The load factor a = n/N provided the load factor is not close
to 100%
Assuming that the hash values
are like random numbers, it can
be shown that the expected
number of probes for an insertion
with open addressing is
1 / (1 - a)
30
Questions?