Greedy Algorithms
Lecture 22
29th November, 2013
Huffman Codes
Suppose we have a 100,000 character data file that we wish to store
compactly
Frequencies of the characters are given in Table
A fixed length code requires 3 bits, so 300,000 bits are required to code
entire file
a b c d e f
Frequency (in thousands) 45 13 12 16 9 5
Fixed length code 000 001 010 011 100 101
November, 2013
Huffman Codes
A widely used and very efficient technique for compressing data
Savings of 20% to 90% are typical, depending on characteristics
of data being compressed
Huffman coding is an entropy encoding algorithm used for
lossless data compression
We consider data to be a sequence of characters
The algorithm uses a greedy strategy
November, 2013
Huffman Codes
Consider a variable length code
If we give frequent characters short codes and infrequent characters
long codes, we can do better
For example, using the coding scheme in the table, bits required are
(45*1+13*3+12*3+16*3+9*4+5*4)*1000=224,000 bits
a b c d e f
Frequency (in thousands) 45 13 12 16 9 5
Fixed length code 000 001 010 011 100 101
Variable length code 0 101 100 111 1101 1100
November, 2013
Huffman Codes
Codes in which no codeword is prefix of some other codeword is called
a prefix code
Encoding is simple: A 3 character file abc is coded as 0.101.100 which
is 0101100
Prefix codes simplify decoding
The codeword that begins an encoded file is unambiguous. We can
simply identify the initial codeword, translate it back to original, and
repeat decoding process on remainder of the encoded file
e.g. 001011101 parses uniquely to aabe
November, 2013
Huffman Codes
Decoding requires an appropriate representation of the prefix
code
A binary tree whose leaves are the given characters provides one
such representation
0 100 1
55
0 100 1 a 0 1
86 14 25
1 30
0 1
14 0 0 1
58 28 c fb 14 d
1
0 1 0 1 0 0 1
a b c d e f f e
November, 2013
Huffman Pseudocode
Huffman invented a greedy algorithm that constructs an optimal prefix
code called a Huffman code
November, 2013
Analysis
The Huffman code tree is generated by using a priority queue(heap data
structure).
Initially, all the characters are stored in the heap. A data element is
extracted from a heap in O(lg n) time.
Each cycle, extracts two nodes and frequencies from the queue, and
adds one node and one frequency to the queue. This operation is
repeated n times. Thus, altogether 3n times the queue is enqueued
/dequeued.
The total time for extracting and storing is 3n.lg n.
Therefore, running time for the creation of code tree is O(nlg n)
November, 2013
String Matching
Finding all occurrences of a pattern in a text
More formally, assume text is an array T[1..n] of length n, and the
pattern is an array P[1..m] of length m <=n. We further assume that
elements of P and T are characters drawn from a finite alphabet Σ. For
example, we may have Σ={0, 1} or Σ={a, b, …, z}
We say that P occurs with shift s in text T if 0<=s<=n-m and
T[s+1..s+m]=P[1..m]. If P occurs with shift s in T, then we call s a
valid shift otherwise we call s an invalid shift
The string matching problem is the problem of finding all valid shifts
for which a given pattern P occurs in a given text T
November, 2013
Naïve String Matching Algorithm
The brute force algorithm makes character-by-character
comparison between the pattern and given text block
The pattern is moved against the text to examine all possible
comparisons
The figure shows the initial and terminal placements of the
pattern
November, 2013
Naïve String Matching Algorithm
November, 2013
Naïve String Matching Algorithm
The naïve algorithm finds all valid shifts using a loop that checks
the condition P[1..m]=T[s+1…s+m] for each of the n-m+1
possible values of s
NAIVE-STRING-MATCHER(T, P)
1. n ← length[T]
2. m ← length[P]
3. For s ← 0 to n-m
4. do if P[1..m] = T[s+1..s+m]
5. then print ‘pattern occurs with shift’ s
November, 2013
Naïve String Matching Algorithm
November, 2013
Naïve String Matching Algorithm
November, 2013
Analysis
The code consists of two nested loops
for i =1 to n-m+1 do
for j = 1 to m do
The outer loop iterates n-m+1 times. The inner loop iterates m
times. Thus, in worst case, altogether m(n-m+1) iterations are
performed.
Thus, running time for the algorithm is O(m(n-m+1)), which is
simplified as O(mn)
November, 2013