Elementary Algorithms
Elementary Algorithms
Contents
I
Preface
0.1
0.2
0.3
0.4
0.5
0.6
II
Why? . . . . . . . . . . . . . . . . . . . . . . . . . . .
The smallest free ID problem, the power of algorithms
0.2.1 Improvement 1 . . . . . . . . . . . . . . . . . .
0.2.2 Improvement 2, Divide and Conquer . . . . . .
0.2.3 Expressiveness vs. Performance . . . . . . . . .
The number puzzle, power of data structure . . . . . .
0.3.1 The brute-force solution . . . . . . . . . . . . .
0.3.2 Improvement 1 . . . . . . . . . . . . . . . . . .
0.3.3 Improvement 2 . . . . . . . . . . . . . . . . . .
Notes and short summary . . . . . . . . . . . . . . . .
Structure of the contents . . . . . . . . . . . . . . . . .
Appendix . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Trees
23
7
7
8
9
10
12
12
12
15
18
18
19
.
.
.
.
.
.
structure
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
27
29
30
33
33
34
34
36
40
40
.
.
.
.
.
.
43
43
44
46
47
49
50
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
53
54
56
58
59
63
71
73
4 AVL tree
4.1 Introduction . . . . . . . . . . . . . . . . . . .
4.1.1 How to measure the balance of a tree?
4.2 Definition of AVL tree . . . . . . . . . . . . .
4.3 Insertion . . . . . . . . . . . . . . . . . . . . .
4.3.1 Balancing adjustment . . . . . . . . .
4.3.2 Pattern Matching . . . . . . . . . . . .
4.4 Deletion . . . . . . . . . . . . . . . . . . . . .
4.5 Imperative AVL tree algorithm ? . . . . . . .
4.6 Chapter note . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
77
77
80
82
86
87
88
92
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
96
96
97
99
100
100
102
108
109
109
111
112
113
113
114
119
121
121
125
130
6 Suffix Tree
6.1 Introduction . . . . . . . . . . . . . .
6.2 Suffix trie . . . . . . . . . . . . . . .
6.2.1 Node transfer and suffix link
6.2.2 On-line construction . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
133
133
134
135
136
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
6.3
6.4
6.5
Suffix
6.3.1
Suffix
6.4.1
6.4.2
6.4.3
6.4.4
6.4.5
Notes
5
Tree . . . . . . . . . . . . . . . . . . .
On-line construction . . . . . . . . .
tree applications . . . . . . . . . . . .
String/Pattern searching . . . . . . .
Find the longest repeated sub-string
Find the longest common sub-string
Find the longest palindrome . . . . .
Others . . . . . . . . . . . . . . . . .
and short summary . . . . . . . . . .
7 B-Trees
7.1 Introduction . . . . . . . . . . . . .
7.2 Insertion . . . . . . . . . . . . . . .
7.2.1 Splitting . . . . . . . . . . .
7.3 Deletion . . . . . . . . . . . . . . .
7.3.1 Merge before delete method
7.3.2 Delete and fix method . . .
7.4 Searching . . . . . . . . . . . . . .
7.5 Notes and short summary . . . . .
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
141
150
150
152
153
155
156
156
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
159
159
161
161
168
168
176
181
183
Heaps
8 Binary Heaps
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Implicit binary heap by array . . . . . . . . . . . . . .
8.2.1 Definition . . . . . . . . . . . . . . . . . . . . .
8.2.2 Heapify . . . . . . . . . . . . . . . . . . . . . .
8.2.3 Build a heap . . . . . . . . . . . . . . . . . . .
8.2.4 Basic heap operations . . . . . . . . . . . . . .
8.2.5 Heap sort . . . . . . . . . . . . . . . . . . . . .
8.3 Leftist heap and Skew heap, the explicit binary heaps
8.3.1 Definition . . . . . . . . . . . . . . . . . . . . .
8.3.2 Merge . . . . . . . . . . . . . . . . . . . . . . .
8.3.3 Basic heap operations . . . . . . . . . . . . . .
8.3.4 Heap sort by Leftist Heap . . . . . . . . . . . .
8.3.5 Skew heaps . . . . . . . . . . . . . . . . . . . .
8.4 Splay heap . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Definition . . . . . . . . . . . . . . . . . . . . .
8.4.2 Heap sort . . . . . . . . . . . . . . . . . . . . .
8.5 Notes and short summary . . . . . . . . . . . . . . . .
187
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
189
189
189
190
191
192
194
199
201
202
203
204
206
206
208
208
215
215
sort
. . .
. . .
. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
219
219
221
222
223
224
225
225
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
226
227
231
231
239
240
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
243
243
243
243
248
258
258
260
269
271
273
275
275
276
282
9.4
9.5
IV
. . .
. . .
. . .
. . .
heap
. . .
. . .
. . .
. . .
. . .
sort
. . .
.
.
.
.
.
.
.
.
.
.
.
.
285
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
287
287
288
288
291
294
294
296
298
300
307
310
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
313
313
314
314
314
316
322
324
327
327
328
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
328
329
331
335
335
337
340
341
346
347
349
354
365
369
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
371
371
371
372
373
374
377
379
380
383
383
390
394
395
395
396
403
403
404
409
411
416
419
419
14 Searching
14.1 Introduction . . . . . . . . . . . . .
14.2 Sequence search . . . . . . . . . . .
14.2.1 Divide and conquer search .
14.2.2 Information reuse . . . . . .
14.3 Solution searching . . . . . . . . .
14.3.1 DFS and BFS . . . . . . . .
14.3.2 Search the optimal solution
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
423
423
423
424
444
471
471
507
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
14.4 Short summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
VI
Appendix
539
Appendices
A Lists
A.1 Introduction . . . . . . . . . . . . . . . . . . .
A.2 List Definition . . . . . . . . . . . . . . . . .
A.2.1 Empty list . . . . . . . . . . . . . . . .
A.2.2 Access the element and the sub list . .
A.3 Basic list manipulation . . . . . . . . . . . . .
A.3.1 Construction . . . . . . . . . . . . . .
A.3.2 Empty testing and length calculating .
A.3.3 indexing . . . . . . . . . . . . . . . . .
A.3.4 Access the last element . . . . . . . .
A.3.5 Reverse indexing . . . . . . . . . . . .
A.3.6 Mutating . . . . . . . . . . . . . . . .
A.3.7 sum and product . . . . . . . . . . . .
A.3.8 maximum and minimum . . . . . . . .
A.4 Transformation . . . . . . . . . . . . . . . . .
A.4.1 mapping and for-each . . . . . . . . .
A.4.2 reverse . . . . . . . . . . . . . . . . . .
A.5 Extract sub-lists . . . . . . . . . . . . . . . .
A.5.1 take, drop, and split-at . . . . . . . .
A.5.2 breaking and grouping . . . . . . . . .
A.6 Folding . . . . . . . . . . . . . . . . . . . . .
A.6.1 folding from right . . . . . . . . . . . .
A.6.2 folding from left . . . . . . . . . . . .
A.6.3 folding in practice . . . . . . . . . . .
A.7 Searching and matching . . . . . . . . . . . .
A.7.1 Existence testing . . . . . . . . . . . .
A.7.2 Looking up . . . . . . . . . . . . . . .
A.7.3 finding and filtering . . . . . . . . . .
A.7.4 Matching . . . . . . . . . . . . . . . .
A.8 zipping and unzipping . . . . . . . . . . . . .
A.9 Notes and short summary . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
541
541
541
542
542
543
543
544
545
546
547
549
559
563
567
567
573
575
575
577
582
582
584
587
588
588
588
589
592
594
597
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
601
601
603
603
604
605
606
606
606
607
607
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
10
CONTENTS
Part I
Preface
11
Elementary Algorithms
0.1
13
Why?
Are algorithms useful?. Some programmers say that they seldom use any
serious data structures or algorithms in real work such as commercial application
development. Even when they need some of them, they have already been
provided by libraries. For example, the C++ standard template library (STL)
provides sort and selection algorithms as well as the vector, queue, and set data
structures. It seems that knowing about how to use the library as a tool is quite
enough.
Instead of answering this question directly, I would like to say algorithms
and data structures are critical in solving interesting problems, the usefulness
of the problem set aside.
Lets start with two problems that looks like they can be solved in a bruteforce way even by a fresh programmer.
0.2
This problem is discussed in Chapter 1 of Richard Birds book [1]. Its common
that applications and systems use ID (identifier) to manage objects and entities.
At any time, some IDs are used, and some of them are available for use. When
some client tries to acquire a new ID, we want to always allocate it the smallest
available one. Suppose IDs are non-negative integers and all IDs in use are kept
in a list (or an array) which is not ordered. For example:
[18, 4, 8, 9, 16, 1, 14, 7, 19, 3, 0, 5, 2, 11, 6]
How can you find the smallest free ID, which is 10, from the list?
It seems the solution is quite easy even without any serious algorithms.
1: function Min-Free(A)
2:
x0
3:
loop
4:
if x
/ A then
5:
return x
6:
else
7:
xx+1
Where the
/ is realized like below.
1: function (x,
/
X)
2:
for i 1 to |X| do
3:
if x = X[i] then
4:
return False
5:
return True
Some languages provide handy tools which wrap this linear time process. For
example in Python, this algorithm can be directly translated as the following.
def b r u t e f o r c e ( l s t ) :
i = 0
while True :
i f i not in l s t :
14
Preface
return i
i = i + 1
0.2.1
Improvement 1
The key idea to improve the solution is based on a fact that for a series of n
numbers x1 , x2 , ..., xn , if there are free numbers, some of the xi are outside the
range [0, n); otherwise the list is exactly a permutation of 0, 1, ..., n 1 and n
should be returned as the minimum free number. It means that max(xi ) n1.
And we have the following fact.
minf ree(x1 , x2 , ..., xn ) n
(1)
0.2.2
Although the above improvement is much faster, it costs O(n) extra spaces to
keep a check list. if n is huge number this means a huge amount of space is
wasted.
The typical divide and conquer strategy is to break the problem into some
smaller ones, and solve these to get the final answer.
We can put all numbers xi bn/2c as a sub-list A0 and put all the others
as a second sub-list A00 . Based on formula 1 if the length of A0 is exactly bn/2c,
this means the first half of numbers are full, which indicates that the minimum
free number must be in A00 and so well need to recursively seek in the shorter
list A00 . Otherwise, it means the minimum free number is located in A0 , which
again leads to a smaller problem.
When we search the minimum free number in A00 , the conditions changes
a little bit, we are not searching the smallest free number starting from 0, but
16
Preface
actually from bn/2c + 1 as the lower bound. So the algorithm is something like
minf ree(A, l, u), where l is the lower bound and u is the upper bound index of
the element.
Note that there is a trivial case, that if the number list is empty, we merely
return the lower bound as the result.
This divide and conquer solution can be formally expressed as a function :
minf ree(A) = search(A, 0, |A| 1)
search(A, l, u) =
l : A=
search(A00 , m + 1, u) : |A0 | = m l + 1
search(A0 , l, m) : otherwise
where
l+u
c
2
0
A = {x A x m}
A00 = {x A x > m}
m=b
It is obvious that this algorithm doesnt need any extra space2 . Each call
performs O(|A|) comparison to build A0 and A00 . After that the problem scale
halves. So the time needed for this algorithm is T (n) = T (n/2) + O(n) which
reduce to O(n). Another way to analyze the performance is by observing that
the first call takes O(n) to build A0 and A00 and the second call takes O(n/2), and
O(n/4) for the third... The total time is O(n + n/2 + n/4 + ...) = O(2n) = O(n)
.
In functional programming languages such as Haskell, partitioning a list has
already been provided in the basic library and this algorithm can be translated
as the following.
import [Link]
minFree xs = bsearch xs 0 (length xs - 1)
bsearch xs l u | xs == [] = l
| length as == m - l + 1 = bsearch bs (m+1) u
| otherwise = bsearch as l m
where
m = (l + u) div 2
(as, bs) = partition ( m) xs
0.2.3
This program uses a quick-sort like approach to re-arrange the array so that
all the elements before lef t are less than or equal to m; while those between
lef t and right are greater than m. This is shown in figure 1.
left
x[i]<=m
right
x[i]>m
...?...
Figure 1: Divide the array, all x[i] m where 0 i < lef t; while all x[i] > m
where lef t i < right. The left elements are unknown.
This program is fast and it doesnt need extra stack space. However, compared to the previous Haskell program, its hard to read and the expressiveness
decreased. We have to balance performance and expressiveness.
3 This is done automatically in most functional languages since our function is in tail
recursive form which lends itself perfectly to this transformation
18
0.3
Preface
If the first problem, to find the minimum free number, is a some what useful in
practice, this problem is a pure one for fun. The puzzle is to find the 1,500th
number, which only contains factor 2, 3 or 5. The first 3 numbers are of course
2, 3, and 5. Number 60 = 22 31 51 , However it is the 25th number. Number
21 = 20 31 71 , isnt a valid number because it contains a factor 7. The first 10
such numbers are list as the following.
2,3,4,5,6,8,9,10,12,15
If we consider 1 = 20 30 50 , then 1 is also a valid number and it is the first
one.
0.3.1
It seems the solution is quite easy without need any serious algorithms. We can
check all numbers from 1, then extract all factors of 2, 3 and 5 to see if the left
part is 1.
1: function Get-Number(n)
2:
x1
3:
i0
4:
loop
5:
if Valid?(x) then
6:
ii+1
7:
if i = n then
8:
return x
9:
xx+1
function Valid?(x)
while x mod 2 = 0 do
12:
x x/2
13:
while x mod 3 = 0 do
14:
x x/3
15:
while x mod 5 = 0 do
16:
x x/5
17:
if x = 1 then
18:
return T rue
19:
else
20:
return F alse
This brute-force algorithm works for most small n. However, to find the
1500th number (which is 859963392), the C program based on this algorithm
takes 40.39 seconds in my computer. I have to kill the program after 10 minutes
when I increased n to 15,000.
10:
11:
0.3.2
Improvement 1
Analysis of the above algorithm shows that modular and divide calculations
are very expensive [2]. And they executed a lot in loops. Instead of checking a
number contains only 2, 3, or 5 as factors, one alternative solution is to construct
such number by these factors.
19
1*2=2
3*2=6
1*3=3
3*3=9
1*5=5
10
3*5=15
2*2=4
4*2=8
2*3=6
4*3=12
2*5=10
10
15
4*5=20
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
20
Preface
The insert function takes O(|Q|) time to find the proper position and insert
it. If the element has already existed, it just returns.
A rough estimation tells that the length of the queue increase proportion to
n, (Each time, we extract one element, and pushed 3 new, the increase ratio
2), so the total running time is O(1 + 2 + 3 + ... + n) = O(n2 ).
Figure3 shows the number of queue access time against n. It is quadratic
curve which reflect the O(n2 ) performance.
The C program based on this algorithm takes only 0.016[s] to get the right
answer 859963392. Which is 2500 times faster than the brute force solution.
Improvement 1 can also be considered in recursive way. Suppose X is the
infinity series for all numbers which only contain factors of 2, 3, or 5. The
following formula shows an interesting relationship.
X = {1} {2x : x X} {3x : x X} {5x : x X}
(2)
Where we can define to a special form so that all elements are stored
in order as well as unique to each other. Suppose that X = {x1 , x2 , x3 ...},
Y = {y1 , y2 , y3 , ...}, X 0 = {x2 , x3 , ...} and Y 0 = {y2 , y3 , ...}. We have
X : Y =
Y : X=
{x1 , X 0 Y } : x1 < y1
X Y =
{x1 , X 0 Y 0 } : x1 = y1
{y1 , X Y 0 } : x1 > y1
In a functional programming language such as Haskell, which supports lazy
evaluation, The above infinity series functions can be translate into the following
program.
ns = 1:merge (map (2) ns) (merge (map (3) ns) (map (5) ns))
merge [] l = l
merge l [] = l
merge (x:xs) (y:ys) | x <y = x : merge xs (y:ys)
21
| x ==y = x : merge xs ys
| otherwise = y : merge (x:xs) ys
0.3.3
Improvement 2
Considering the above solution, although it is much faster than the brute-force
one, It still has some drawbacks. It produces many duplicated numbers and
they are finally dropped when examine the queue. Secondly, it does linear scan
and insertion to keep the order of all elements in the queue, which degrade the
ENQUEUE operation from O(1) to O(|Q|).
If we use three queues instead of using only one, we can improve the solution
one step ahead. Denote these queues as Q2 , Q3 , and Q5 , and we initialize them
as Q2 = {2}, Q3 = {3} and Q5 = {5}. Each time we DEQUEUEed the smallest
one from Q2 , Q3 , and Q5 as x. And do the following test:
If x comes from Q2 , we ENQUEUE 2x, 3x, and 5x back to Q2 , Q3 , and
Q5 respectively;
If x comes from Q3 , we only need ENQUEUE 3x to Q3 , and 5x to Q5 ;
We neednt ENQUEUE 2x to Q2 , because 2x have already existed in Q3 ;
If x comes from Q5 , we only need ENQUEUE 5x to Q5 ; there is no need
to ENQUEUE 3x, 5x to Q3 , Q5 because they have already been in the
queues;
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
We repeatedly ENQUEUE the smallest one until we find the n-th element.
The algorithm based on this idea is implemented as below.
function Get-Number(n)
if n = 1 then
return 1
else
Q2 {2}
Q3 {3}
Q5 {5}
while n > 1 do
x min(Head(Q2 ), Head(Q3 ), Head(Q5 ))
if x = Head(Q2 ) then
Dequeue(Q2 )
Enqueue(Q2 , 2x)
Enqueue(Q3 , 3x)
Enqueue(Q5 , 5x)
else if x = Head(Q3 ) then
Dequeue(Q3 )
Enqueue(Q3 , 3x)
Enqueue(Q5 , 5x)
else
Dequeue(Q5 )
22
Preface
2*min=4
3*min=6
5*min=10
3*min=9
min=2
5*min=15
10
min=3
2*min=8
3*min=12
5*min=20
10
15
min=4
5*min=25
12
10
15
20
min=5
23
Enqueue(Q5 , 5x)
22:
nn1
23:
return x
This algorithm loops n times, and within each loop, it extract one head
element from the three queues, which takes constant time. Then it appends
one to three new elements at the end of queues which bounds to constant time
too. So the total time of the algorithm bounds to O(n). The C++ program
translated from this algorithm shown below takes less than 1 s to produce the
1500th number, 859963392.
21:
This solution can be also implemented in Functional way. We define a function take(n), which will return the first n numbers contains only factor 2, 3, or
5.
take(n) = f (n, {1}, {2}, {3}, {5})
Where
f (n, X, Q2 , Q3 , Q5 ) =
X : n=1
f (n 1, X {x}, Q02 , Q03 , Q05 ) : otherwise
24
Preface
: x = Q21
: x = Q31
: x = Q51
Invoke last takeN 1500 will generate the correct answer 859963392.
0.4
If review the 2 puzzles, we found in both cases, the brute-force solutions are so
weak. In the first problem, its quite poor in dealing with long ID list, while in
the second problem, it doesnt work at all.
The first problem shows the power of algorithms, while the second problem
tells why data structure is important. There are plenty of interesting problems,
which are hard to solve before computer was invented. With the aid of computer and programming, we are able to find the answer in a quite different way.
Compare to what we learned in mathematics course in school, we havent been
taught the method like this.
While there have been already a lot of wonderful books about algorithms,
data structures and math, however, few of them provide the comparison between
the procedural solution and the functional solution. From the above discussion,
it can be found that functional solution sometimes is very expressive and they
are close to what we are familiar in mathematics.
This series of post focus on providing both imperative and functional algorithms and data structures. Many functional data structures can be referenced
from Okasakis book[6]. While the imperative ones can be founded in classic
text books [2] or even in WIKIpedia. Multiple programming languages, including, C, C++, Python, Haskell, and Scheme/Lisp will be used. In order to make
it easy to read by programmers with different background, pseudo code and
mathematical function are the regular descriptions of each post.
The author is NOT a native English speaker, the reason why this book is
only available in English for the time being is because the contents are still
changing frequently. Any feedback, comments, or criticizes are welcome.
0.5
In the following series of post, Ill first introduce about elementary data structures before algorithms, because many algorithms need knowledge of data structures as prerequisite.
0.6. APPENDIX
25
The hello world data structure, binary search tree is the first topic; Then
we introduce how to solve the balance problem of binary search tree. After
that, Ill show other interesting trees. Trie, Patricia, suffix trees are useful in
text manipulation. While B-trees are commonly used in file system and data
base implementation.
The second part of data structures is about heaps. Well provide a general Heap definition and introduce about binary heaps by array and by explicit
binary trees. Then well extend to K-ary heaps including Binomial heaps, Fibonacci heaps, and pairing heaps.
Array and queues are considered among the easiest data structures typically,
However, well show how difficult to implement them in the third part.
As the elementary sort algorithms, well introduce insertion sort, quick sort,
merge sort etc in both imperative way and functional way.
The final part is about searching, besides the element searching, well also
show string matching algorithms such as KMP.
All the posts are provided under GNU FDL (Free document license), and
programs are under GNU GPL.
0.6
Appendix
All programs provided along with this book are free for downloading. download
position: [Link]
26
Preface
Bibliography
[1] Richard Bird. Pearls of functional algorithm design. Cambridge University Press; 1 edition (November 1, 2010). ISBN-10: 0521513383
[2] Jon Bentley. Programming Pearls(2nd Edition). Addison-Wesley Professional; 2 edition (October 7, 1999). ISBN-13: 978-0201657883
[3] Chris Okasaki. Purely Functional Data Structures. Cambridge university
press, (July 1, 1999), ISBN-13: 978-0521663502
[4] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford
Stein. Introduction to Algorithms, Second Edition. The MIT Press, 2001.
ISBN: 0262032937.
27
28
BIBLIOGRAPHY
Part II
Trees
29
Chapter 1
Introduction
Arrays or lists are typically considered the hello world data structures. However, well see they are not actually particularly easy to implement. In some
procedural settings, arrays are the most elementary data structures, and it is
possible to implement linked lists using arrays (see section 10.3 in [2]). On the
other hand, in some functional settings, linked lists are the elementary building
blocks used to create arrays and other data structures.
Considering these factors, we start with Binary Search Trees (or BST) as the
hello world data structure using an interesting problem Jon Bentley mentioned
in Programming Pearls [2]. The problem is to count the number of times each
word occurs in a large text. One solution in C++ is below:
int main(int, char ){
map<string, int> dict;
string s;
while(cin>>s)
++dict[s];
map<string, int>::iterator it=[Link]();
for(; it!=[Link](); ++it)
cout<<itfirst<<": "<<itsecond<<"n";
}
And we can run it to produce the result using the following UNIX commands
1
31
16
14
10
33
4
16
10
14
1.2
Data Layout
Based on the recursive definition of BSTs, we can draw the data layout in
procedural setting with pointers as in Figure 1.3.
The node contains a field of key, which can be augmented with satellite data;
a field contains a pointer to the left child and a field point to the right child. In
order to back-track an ancestor easily, a parent field can be provided as well.
In this post, well ignore the satellite data for simple illustration purpose.
Based on this layout, the node of binary search tree can be defined in a procedural language, such as C++ as the following.
template<class T>
struct node{
node(T x):key(x), left(0), right(0), parent(0){}
~node(){
delete left;
delete right;
}
node left;
node right;
node parent; //parent is optional, its helpful for succ/pred
T key;
};
There is another setting, for instance in Scheme/Lisp languages, the elementary data structure is linked-list. Figure 1.4 shows how a binary search tree
node can be built on top of linked-list.
left
left
right
right
parent
parent
...
...
...
...
key
next
left ...
next
right ...
NIL
Figure 1.4: Binary search tree node layout on top of linked list. Where left...
and right ... are either empty or binary search tree node composed in the same
way.
1.3. INSERTION
35
Because in pure functional setting, Its hard to use pointer for back tracking
the ancestors, (and typically, there is no need to do back tracking, since we can
provide top-down solution in recursive way) there is not parent field in such
layout.
For simplified reason, well skip the detailed layout in the future, and only
focus on the logic layout of data structures. For example, below is the definition
of binary search tree node in Haskell.
data Tree a = Empty
| Node (Tree a) a (Tree a)
1.3
Insertion
To insert a key k (may be along with a value in practice) to a binary search tree
T , we can follow a quite straight forward way.
If the tree is empty, then construct a leave node with key=k;
If k is less than the key of root node, insert it to the left child;
If k is greater than the key of root, insert it to the right child;
There is an exceptional case that if k is equal to the key of root, it means it
has already existed, we can either overwrite the data, or just do nothing. For
simple reason, this case is skipped in this post.
This algorithm is described recursively. It is so simple that is why we consider
binary search tree is hello world data structure. Formally, the algorithm can
be represented with a recursive function.
insert(T, k) =
node(, k, ) : T =
node(insert(L, k), Key, R) : k < Key
(1.1)
Where
L = lef t(T )
R = right(T )
Key = key(T )
The node function creates a new node with given left sub-tree, a key and a
right sub-tree as parameters. means NIL or Empty. function lef t, right and
key are access functions which can get the left sub-tree, right sub-tree and the
key of a node.
Translate the above functions directly to Haskell yields the following program.
insert::(Ord a) Tree a a Tree a
insert Empty k = Node Empty k Empty
insert (Node l x r) k | k < x = Node (insert l k) x r
| otherwise = Node l x (insert r k)
This program utilized the pattern matching features provided by the language. However, even in functional settings without this feature, for instance,
Scheme/Lisp, the program is still expressive.
1.4
Traversing
Traversing means visiting every element one by one in a binary search tree.
There are 3 ways to traverse a binary tree, pre-order tree walk, in-order tree
walk, and post-order tree walk. The names of these traversing methods highlight
the order of when we visit the root of a binary search tree.
1.4. TRAVERSING
37
Since there are three parts in a tree, as left child, the root, which contains the key and satellite data, and the right child. If we denote them as
(lef t, current, right), the three traversing methods are defined as the following.
pre-order traverse, visit current, then lef t, finally right;
in-order traverse, visit lef t , then current, finally right;
post-order traverse, visit lef t, then right, finally current.
Note that each visiting operation is recursive. And we see the order of
visiting current determines the name of the traversing method.
For the binary search tree shown in figure 1.2, below are the three different
traversing results.
pre-order traverse result: 4, 3, 1, 2, 8, 7, 16, 10, 9, 14;
in-order traverse result: 1, 2, 3, 4, 7, 8, 9, 10, 14, 16;
post-order traverse result: 2, 1, 3, 7, 9, 14, 10, 16, 8, 4;
It can be found that the in-order walk of a binary search tree outputs the
elements in increase order, which is particularly helpful. The definition of binary
search tree ensures this interesting property, while the proof of this fact is left
as an exercise of this post.
In-order tree walk algorithm can be described as the following:
If the tree is empty, just return;
traverse the left child by in-order walk, then access the key, finally traverse
the right child by in-order walk.
Translate the above description yields a generic map function
: T =
map(f, T ) =
node(l0 , k 0 , r0 ) : otherwise
(1.2)
where
l0 = map(f, lef t(T ))
r0 = map(f, right(T ))
k 0 = f (key(T ))
If we only need access the key without create the transformed tree, we can
realize this algorithm in procedural way lie the below C++ program.
template<class T, class F>
void in_order_walk(node<T> t, F f){
if(t){
in_order_walk(tleft, f);
f(tvalue);
in_order_walk(tright, f);
}
}
: T =
toList(lef t(T )) {key(T )} toList(right(T )) : otherwise
(1.3)
Below is the Haskell program based on this definition.
toList(T ) =
(1.4)
(1.5)
insert
For the readers who are not familiar with folding from left, this function can
also be defined recursively as the following.
f romList(X) =
: X=
insert(f romList({x2 , x3 , ..., xn }), x1 ) : otherwise
Well intense use folding function as well as the function composition and
partial evaluation in the future, please refer to appendix of this book or [6] [7]
and [8] for more information.
Exercise 1.1
Given the in-order traverse result and pre-order traverse result, can you reconstruct the tree from these result and figure out the post-order traversing
result?
Pre-order result: 1, 2, 4, 3, 5, 6; In-order result: 4, 2, 1, 5, 3, 6; Post-order
result: ?
Write a program in your favorite language to re-construct the binary tree
from pre-order result and in-order result.
39
Prove why in-order walk output the elements stored in a binary search
tree in increase order?
Can you analyze the performance of tree sort with big-O notation?
1.5
There are three types of querying for binary search tree, searching a key in the
tree, find the minimum or maximum element in the tree, and find the predecessor
or successor of an element in the tree.
1.5.1
Looking up
According to the definition of binary search tree, search a key in a tree can be
realized as the following.
If the tree is empty, the searching fails;
If the key of the root is equal to the value to be found, the search succeed.
The root is returned as the result;
If the value is less than the key of the root, search in the left child.
Else, which means that the value is greater than the key of the root, search
in the right child.
This algorithm can be described with a recursive
T :
lookup(T, x) =
lookup(lef t(T ), x) :
lookup(right(T ), x) :
function as below.
T =
key(T ) = x
x < key(T )
otherwise
(1.6)
In the real application, we may return the satellite data instead of the node
as the search result. This algorithm is simple and straightforward. Here is a
translation of Haskell program.
lookup::(Ord a) Tree a a Tree a
lookup Empty _ = Empty
lookup t@(Node l k r) x | k == x = t
| x < k = lookup l x
| otherwise = lookup r x
If the binary search tree is well balanced, which means that almost all nodes
have both non-NIL left child and right child, for N elements, the search algorithm takes O(lg N ) time to perform. This is not formal definition of balance.
Well show it in later post about red-black-tree. If the tree is poor balanced,
the worst case takes O(N ) time to search for a key. If we denote the height of
the tree as h, we can uniform the performance of the algorithm as O(h).
The search algorithm can also be realized without using recursion in a procedural manner.
1: function Search(T, x)
2:
while T 6= N IL Key(T ) 6= x do
template<class T>
node<T> search(node<T> t, T x){
while(t && tkey!=x){
if(x < tkey) t=tleft;
else t=tright;
}
return t;
}
1.5.2
Minimum and maximum can be implemented from the property of binary search
tree, less keys are always in left child, and greater keys are in right.
For minimum, we can continue traverse the left sub tree until it is empty.
While for maximum, we traverse the right.
key(T ) : lef t(T ) =
min(T ) =
(1.7)
min(lef t(T )) : otherwise
key(T ) : right(T ) =
max(T ) =
(1.8)
max(right(T )) : otherwise
Both function bound to O(h) time, where h is the height of the tree. For the
balanced binary search tree, min/max are bound to O(lg N ) time, while they
are O(N ) in the worst cases.
We skip translating them to programs, Its also possible to implement them
in pure procedural way without using recursion.
1.5.3
41
Well meet many problems in this series of post that they are only valid in
imperative settings, and they are not meaningful problems in functional settings
at all. One good example is how to delete an element in red-black-tree[3].
In this section, well only present the imperative algorithm for finding the
successor and predecessor in a binary search tree.
When finding the successor of element x, which is the smallest one y that
satisfies y > x, there are two cases. If the node with value x has non-NIL right
child, the minimum element in right child is the answer; For example, in Figure
1.2, in order to find the successor of 8, we search its right sub tree for the
minimum one, which yields 9 as the result. While if node x dont have right
child, we need back-track to find the closest ancestors whose left child is also
ancestor of x. In Figure 1.2, since 2 dont have right sub tree, we go back to its
parent 1. However, node 1 dont have left child, so we go back again and reach
to node 3, the left child of 3, is also ancestor of 2, thus, 3 is the successor of
node 2.
Based on this description, the algorithm can be given as the following.
1: function Succ(x)
2:
if Right(x) 6= N IL then
3:
return Min(Right(x))
4:
else
5:
p Parent(x)
6:
while p 6= N IL and x = Right(p) do
7:
xp
8:
p Parent(p)
9:
return p
The predecessor case is quite similar to the successor algorithm, they are
symmetrical to each other.
1: function Pred(x)
2:
if Left(x) 6= N IL then
3:
return Max(Left(x))
4:
else
5:
p Parent(x)
6:
while p 6= N IL and x = Left(p) do
7:
xp
8:
p Parent(p)
9:
return p
Below are the Python programs based on these algorithms. They are changed
a bit in while loop conditions.
def succ(x):
if [Link] is not None: return tree_min([Link])
p = [Link]
while p is not None and [Link] != x:
x=p
p = [Link]
return p
def pred(x):
if [Link] is not None: return tree_max([Link])
p = [Link]
Exercise 1.2
Can you figure out how to iterate a tree as a generic container by using
pred()/succ()? Whats the performance of such traversing process in terms
of big-O?
A reader discussed about traversing all elements inside a range [a, b]. In
C++, the algorithm looks like the below code:
for each([Link] bound(12), [Link] bound(26), f );
Can you provide the purely function solution for this problem?
1.6
Deletion
Deletion is another imperative only topic for binary search tree. This is because
deletion mutate the tree, while in purely functional settings, we dont modify
the tree after building it in most application.
However, One method of deleting element from binary search tree in purely
functional way is shown in this section. Its actually reconstructing the tree but
not modifying the tree.
Deletion is the most complex operation for binary search tree. this is because
we must keep the BST property, that for any node, all keys in left sub tree are
less than the key of this node, and they are all less than any keys in right sub
tree. Deleting a node can break this property.
In this post, different with the algorithm described in [2], A simpler one from
SGI STL implementation is used.[6]
To delete a node x from a tree.
If x has no child or only one child, splice x out;
Otherwise (x has two children), use minimum element of its right sub tree
to replace x, and splice the original minimum element out.
The simplicity comes from the truth that, the minimum element is stored in
a node in the right sub tree, which cant have two non-NIL children. It ends up
in the trivial case, the the node can be directly splice out from the tree.
Figure 1.5, 1.6, and 1.7 illustrate these different cases when deleting a node
from the tree.
Based on this idea, the deletion can be defined as the below function.
node(delete(L,
x),
K,
R)
:
:
:
:
:
:
T =
x<K
x>K
x=K L=
x=K R=
otherwise
(1.9)
1.6. DELETION
43
Tree
NIL
NIL
Tree
Tree
x
L
NIL
Tree
Tree
x
R
NIL
Tree
min(R)
Tree
delete(R, min(R))
Where
L = lef t(T )
R = right(T )
K = key(T )
y = min(R)
Translating the function to Haskell yields the below program.
delete::(Ord a) Tree a a Tree a
delete Empty _ = Empty
delete (Node l k r) x | x < k = (Node (delete l x) k r)
| x > k = (Node l k (delete r x))
-- x == k
| isEmpty l = r
| isEmpty r = l
| otherwise = (Node l k (delete r k))
where k = min r
Function isEmpty is used to test if a tree is empty (). Note that the
algorithm first performs search to locate the node where the element need be
deleted, after that it execute the deletion. This algorithm takes O(h) time where
h is the height of the tree.
Its also possible to pass the node but not the element to the algorithm for
deletion. Thus the searching is no more needed.
The imperative algorithm is more complex because it need set the parent
properly. The function will return the root of the result tree.
1: function Delete(T, x)
2:
root T
1.6. DELETION
45
x0 x
. save x
4:
parent Parent(x)
5:
if Left(x) = N IL then
6:
x Right(x)
7:
else if Right(x) = N IL then
8:
x Left(x)
9:
else
. both children are non-NIL
10:
y Min(Right(x))
11:
Key(x) Key(y)
12:
Copy other satellite data from y to x
13:
if Parent(y) 6= x then
. y hasnt left sub tree
14:
Left(Parent(y)) Right(y)
15:
else
. y is the root of right child of x
16:
Right(x) Right(y)
17:
Remove y
18:
return root
19:
if x 6= N IL then
20:
Parent(x) parent
21:
if parent = N IL then
. We are removing the root of the tree
22:
root x
23:
else
24:
if Left(parent) = x0 then
25:
Left(parent) x
26:
else
27:
Right(parent) x
28:
Remove x0
29:
return root
Here we assume the node to be deleted is not empty (otherwise we can simply
returns the original tree). In other cases, it will first record the root of the tree,
create copy pointers to x, and its parent.
If either of the children is empty, the algorithm just splice x out. If it has
two non-NIL children, we first located the minimum of right child, replace the
key of x to ys, copy the satellite data as well, then splice y out. Note that there
is a special case that y is the root node of xs left sub tree.
Finally we need reset the stored parent if the original x has only one nonNIL child. If the parent pointer we copied before is empty, it means that we are
deleting the root node, so we need return the new root. After the parent is set
properly, we finally remove the old x from memory.
The relative Python program for deleting algorithm is given as below. Because Python provides GC, we neednt explicitly remove the node from the
memory.
3:
Exercise 1.3
There is a symmetrical solution for deleting a node which has two non-NIL
children, to replace the element by splicing the maximum one out off the
left sub-tree. Write a program to implement this solution.
1.7
It can be found that all operations given in this post bound to O(h) time for a
tree of height h. The height affects the performance a lot. For a very unbalanced
tree, h tends to be O(N ), which leads to the worst case. While for balanced
tree, h close to O(lg N ). We can gain the good performance.
How to make the binary search tree balanced will be discussed in next post.
However, there exists a simple way. Binary search tree can be randomly built as
described in [2]. Randomly building can help to avoid (decrease the possibility)
unbalanced binary trees. The idea is that before building the tree, we can call
a random process, to shuffle the elements.
Exercise 1.4
Write a randomly building process for binary search tree.
1.8
Appendix
All programs are provided along with this post. They are free for downloading.
We provided C, C++, Python, Haskell, and Scheme/Lisp programs as example.
Bibliography
[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford
Stein. Introduction to Algorithms, Second Edition. ISBN:0262032937.
The MIT Press. 2001
[2] Jon Bentley. Programming Pearls(2nd Edition). Addison-Wesley Professional; 2 edition (October 7, 1999). ISBN-13: 978-0201657883
[3] Chris Okasaki. Ten Years of Purely Functional Data Structures.
[Link]
[4] SGI.
Standard
Template
[Link]
Library
Programmers
Guide.
47
48
Chapter 2
Introduction
(2.1)
1 Some reader may argue that Bubble sort is the easiest sort algorithm. Bubble sort isnt
covered in this book as we dont think its a valuable algorithm[1]
49
50
Note that in above algorithm, we store the sorted result in X, so this isnt
in-place sorting. Its easy to change it to in-place algorithm.
function Sort(A)
for i 2 to Length(A) do
insert Ai to sorted sequence {A01 , A02 , ..., A0i1 }
At any time, when we process the i-th element, all elements before i have
already been sorted. we continuously insert the current elements until consume
all the unsorted data. This idea is illustrated as in figure 9.3.
insert
Figure 2.2: The left part is sorted data, continuously insert elements to sorted
part.
We can find there is recursive concept in this definition. Thus it can be
expressed as the following.
: A=
sort(A) =
(2.2)
insert(sort({A2 , A3 , ...}), A1 ) : otherwise
2.2
Insertion
We havent answered the question about how to realize insertion however. Its
a puzzle how does human locate the proper position so quickly.
For computer, its an obvious option to perform a scan. We can either scan
from left to right or vice versa. However, if the sequence is stored in plain array,
its necessary to scan from right to left.
function Sort(A)
for i 2 to Length(A) do . Insert A[i] to sorted sequence A[1...i 1]
2.2. INSERTION
51
x A[i]
j i1
while j > 0 x < A[j] do
A[j + 1] A[j]
j j1
A[j + 1] x
One may think scan from left to right is natural. However, it isnt as effect
as above algorithm for plain array. The reason is that, its expensive to insert an
element in arbitrary position in an array. As array stores elements continuously.
If we want to insert new element x in position i, we must shift all elements after
i, including i + 1, i + 2, ... one position to right. After that the cell at position i
is empty, and we can put x in it. This is illustrated in figure 2.3.
x
insert
A[1]
A[2]
...
A[i-1]
A[i]
A[i+1]
A[i+2]
...
A[n-1]
It can be found some other equivalent programs, for instance the following
ANSI C program. However this version isnt as effective as the pseudo code.
void isort(Key xs, int n){
int i, j;
for(i=1; i<n; ++i)
for(j=i-1; j 0 && xs[j+1] < xs[j]; --j)
swap(xs, j, j+1);
}
A[n]
empty
52
This is because the swapping function, which can exchange two elements
typically uses a temporary variable like the following:
void swap(Key xs, int i, int j){
Key temp = xs[i];
xs[i] = xs[j];
xs[j] = temp;
}
Exercise 2.1
Provide explicit insertion function, and call it with general insertion sort
algorithm. Please realize it in both procedural way and functional way.
2.3
Improvement 1
Lets go back to the question, that why human being can find the proper position
for insertion so quickly. We have shown a solution based on scan. Note the fact
that at any time, all cards at hands have been well sorted, another possible
solution is to use binary search to find that location.
Well explain the search algorithms in other dedicated chapter. Binary search
is just briefly introduced for illustration purpose here.
The algorithm will be changed to call a binary search procedure.
function Sort(A)
for i 2 to Length(A) do
x A[i]
p Binary-Search(A[1...i 1], x)
for j i down to p do
A[j] A[j 1]
A[p] x
Instead of scan elements one by one, binary search utilize the information
that all elements in slice of array {A1 , ..., Ai1 } are sorted. Lets assume the
order is monotonic increase order. To find a position j that satisfies Aj1
x Aj . We can first examine the middle element, for example, Abi/2c . If x is
less than it, we need next recursively perform binary search in the first half of
the sequence; otherwise, we only need search in last half.
Every time, we halve the elements to be examined, this search process runs
O(lg n) time to locate the insertion position.
function Binary-Search(A, x)
2.4. IMPROVEMENT 2
53
l1
u 1+ Length(A)
while l < u do
m b l+u
2 c
if Am = x then
return m
else if Am < x then
l m+1
else
um
return l
Exercise 2.2
Write the binary search in recursive manner. You neednt use purely functional programming language.
2.4
Improvement 2
Although we improve the search time to O(n lg n) in previous section, the number of moves is still O(n2 ). The reason of why movement takes so long time, is
because the sequence is stored in plain array. The nature of array is continuously layout data structure, so the insertion operation is expensive. This hints
us that we can use linked-list setting to represent the sequence. It can improve
54
insert(A, x) =
{x} : A =
{x} A : x < A1
(2.3)
And we can complete the two versions of insertion sort program based on
the first two equations in this chapter.
isort [] = []
isort (x:xs) = insert (isort xs) x
55
return head;
}
Although we change the insertion operation to constant time by using linkedlist. However, we have to traverse the linked-list to find the position, which
results O(n2 ) times comparison. This is because linked-list, unlike array, doesnt
support random access. It means we cant use binary search with linked-list
setting.
Exercise 2.3
Complete the insertion sort by using linked-list insertion function in your
favorate imperative programming language.
The index based linked-list return the sequence of rearranged index as
result. Write a program to re-order the original array of elements from
this result.
2.5
It seems that we drive into a corner. We must improve both the comparison
and the insertion at the same time, or we will end up with O(n2 ) performance.
We must use binary search, this is the only way to improve the comparison
time to O(lg n). On the other hand, we must change the data structure, because
we cant achieve constant time insertion at a position with plain array.
56
This remind us about our hello world data structure, binary search tree. It
naturally support binary search from its definition. At the same time, We can
insert a new leaf in binary search tree in O(1) constant time if we already find
the location.
So the algorithm changes to this.
function Sort(A)
T
for each x A do
T Insert-Tree(T, x)
return To-List(T )
Where Insert-Tree() and To-List() are described in previous chapter
about binary search tree.
As we have analyzed for binary search tree, the performance of tree sort is
bound to O(n lg n), which is the lower limit of comparison based sort[3].
2.6
Short summary
In this chapter, we present the evolution process of insertion sort. Insertion sort
is well explained in most textbooks as the first sorting algorithm. It has simple
and straightforward idea, but the performance is quadratic. Some textbooks
stop here, but we want to show that there exist ways to improve it by different
point of view. We first try to save the comparison time by using binary search,
and then try to save the insertion operation by changing the data structure to
linked-list. Finally, we combine these two ideas and evolute insertion sort to
tree sort.
Bibliography
[1] [Link] sort
[2] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford
Stein. Introduction to Algorithms, Second Edition. ISBN:0262032937.
The MIT Press. 2001
[3] Donald E. Knuth. The Art of Computer Programming, Volume 3: Sorting
and Searching (2nd Edition). Addison-Wesley Professional; 2 edition (May
4, 1998) ISBN-10: 0201896850 ISBN-13: 978-0201896855
57
58
Chapter 3
Introduction
3.1.1
We have shown the power of binary search tree by using it to count the occurrence of every word in Bible. The idea is to use binary search tree as a dictionary
for counting.
One may come to the idea that to feed a yellow page book 1 to a binary
search tree, and use it to look up the phone number for a contact.
By modifying a bit of the program for word occurrence counting yields the
following code.
int main(int, char ){
ifstream f("[Link]");
map<string, string> dict;
string name, phone;
while(f>>name && f>>phone)
dict[name]=phone;
for(;;){
cout<<"nname: ";
cin>>name;
if([Link](name)==[Link]())
cout<<"not found";
else
cout<<"phone: "<<dict[name];
}
}
This program works well. However, if you replace the STL map with the binary search tree as mentioned previously, the performance will be bad, especially
when you search some names such as Zara, Zed, Zulu.
This is because the content of yellow page is typically listed in lexicographic
order. Which means the name list is in increase order. If we try to insert a
1A
59
...
Exercise 3.1
For a very big yellow page list, one may want to speed up the dictionary
building process by two concurrent tasks (threads or processes). One task
reads the name-phone pair from the head of the list, while the other one
reads from the tail. The building terminates when these two tasks meet
at the middle of the list. What will be the binary search tree looks like
after building? What if you split the the list more than two and use more
tasks?
Can you find any more cases to exploit a binary search tree? Please
consider the unbalanced trees shown in figure 3.2.
3.1.2
In order to avoid such case, we can shuffle the input sequence by randomized
algorithm, such as described in Section 12.4 in [2]. However, this method doesnt
always work, for example the input is fed from user interactively, and the tree
need to be built/updated after each input.
There are many solutions people have ever found to make binary search tree
balanced. Many of them rely on the rotation operations to binary search tree.
Rotation operations change the tree structure while maintain the ordering of
the elements. Thus it either improve or keep the balance property of the binary
search tree.
3.1. INTRODUCTION
61
n
n
n-1
n-2
n-1
...
...
(a)
(b)
m-1
m+1
m-2
m+2
...
...
(c)
3.1.3
Tree rotation
(a)
(b)
Figure 3.3: Tree rotation, rotate-left transforms the tree from left side to right
side, and rotate-right does the inverse transformation.
Tree rotation is a kind of special operation that can transform the tree
structure without changing the in-order traverse result. It based on the fact
that for a specified ordering, there are multiple binary search trees correspond
to it. Figure 3.3 shows the tree rotation. For a binary search tree on the left
side, left rotate transforms it to the tree on the right, and right rotate does the
inverse transformation.
Although tree rotation can be realized in procedural way, there exists quite
simple function description if using pattern matching.
rotateL(T ) =
node(node(a, X, b), Y, c)
T
rotateR(T ) =
3.1. INTRODUCTION
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
63
Replace(x, y)
Set-Children(x, a, b)
Set-Children(y, x, c)
if p = N IL then
T y
return T
function Right-Rotate(T, y)
p Parent(y)
x Left(y)
a Left(x)
b Right(x)
c Right(y)
Replace(y, x)
Set-Children(y, b, c)
Set-Children(x, a, y)
if p = N IL then
T x
return T
. Assume x 6= N IL
function Set-Left(x, y)
Left(x) y
if y 6= N IL then Parent(y) x
function Set-Right(x, y)
Right(x) y
if y 6= N IL then Parent(y) x
function Set-Children(x, L, R)
Set-Left(x, L)
Set-Right(x, R)
function Replace(x, y)
if Parent(x) = N IL then
if y 6= N IL then Parent(y) N IL
else if Left(Parent(x)) = x then Set-Left(Parent(x), y)
elseSet-Right(Parent(x), y)
Parent(x) N IL
Compare these pseudo codes with the pattern matching functions, the former
focus on the structure states changing, while the latter focus on the rotation
process. As the title of this chapter indicated, red-black tree neednt be so
complex as it was thought. Most traditional algorithm text books use the classic
procedural way to teach red-black tree, there are several cases need to deal and
all need carefulness to manipulate the node fields. However, by changing the
mind to functional settings, things get intuitive and simple. Although there is
some performance overhead.
Most of the content in this chapter is based on Chris Okasakis work in [2].
3.2
17
11
15
25
22
27
3.3. INSERTION
65
All read only operations such as search, min/max are as same as in binary
search tree. While only the insertion and deletion are special.
As we have shown in word occurrence example, many implementation of
set or map container are based on red-black tree. One example is the C++
Standard library (STL)[6].
As mentioned previously, the only change in data layout is that there is color
information augmented to binary search tree. This can be represented as a data
field in imperative languages such as C++ like below.
enum Color {Red, Black};
template <class T>
struct node{
Color color;
T key;
node left;
node right;
node parent;
};
Exercise 3.2
Can you prove that a red-black tree with n nodes has height at most
2 lg(n + 1)?
3.3
Insertion
Inserting a new node as what has been mentioned in binary search tree may
cause the tree unbalanced. The red-black properties has to be maintained, so
we need do some fixing by transform the tree after insertion.
When we insert a new key, one good practice is to always insert it as a red
node. As far as the new inserted node isnt the root of the tree, we can keep all
properties except the 4-th one. that it may bring two adjacent red nodes.
Functional and procedural implementation have different fixing methods.
One is intuitive but has some overhead, the other is a bit complex but has
higher performance. Most text books about algorithm introduce the later one.
In this chapter, we focus on the former to show how easily a red-black tree
insertion algorithm can be realized. The traditional procedural method will be
given only for comparison purpose.
As described by Chris Okasaki, there are total 4 cases which violate property
4. All of them has 2 adjacent red nodes. However, they have a uniformed form
after fixing[2] as shown in figure 4.3.
Note that this transformation will move the redness one level up. So this is a
bottom-up recursive fixing, the last step will make the root node red. According
@
A
@
R
@
@
I
@
@
A
3.3. INSERTION
67
to property 2, root is always black. Thus we need final fixing to revert the root
color to black.
Observing that the 4 cases and the fixed result have strong pattern features,
the fixing function can be defined by using the similar method we mentioned in
tree rotation. To avoid too long formula, we abbreviate Color as C, Black as
B, and Red as R.
: match(T )
: otherwise
(3.3)
where function node() can construct a red-black tree node with 4 parameters
as color, the left child, the key and the right child. Function match() can test
if a tree is one of the 4 possible patterns as the following.
balance(T ) =
(3.4)
where
ins(T, k) =
node(R, , k, ) : T =
balance(node(ins(L, k), Key, R)) : k < Key
(3.5)
L, R, Key represent the left child, right child and the key of a tree.
L = lef t(T )
R = right(T )
Key = key(T )
Function makeBlack() is defined as the following, it forces the color of a
non-empty tree to be black.
makeBlack(T ) = node(B, L, Key, R)
(3.6)
Summarize the above functions and use language supported pattern matching features, we can come to the following Haskell program.
insert::(Ord a)RBTree a a RBTree a
insert t x = makeBlack $ ins t where
ins Empty = Node R Empty x Empty
ins (Node color l k r)
| x<k
= balance color (ins l) k r
| otherwise = balance color l k (ins r) --[3]
makeBlack(Node _ l k r) = Node B l k r
Note that the balance function is changed a bit from the original definition.
Instead of passing the tree, we pass the color, the left child, the key and the
right child to it. This can save a pair of boxing and un-boxing operations.
This program doesnt handle the case of inserting a duplicated key. However,
it is possible to handle it either by overwriting, or skipping. Another option is
to augment the data with a linked list[2].
Figure 3.6 shows two red-black trees built from feeding list 11, 2, 14, 1, 7,
15, 5, 8, 4 and 1, 2, ..., 8.
7
14
11
15
Exercise 3.3
3.4. DELETION
3.4
69
Deletion
Remind the deletion section in binary search tree. Deletion is imperative only
for red-black tree as well. In typically practice, it often builds the tree just one
time, and performs looking up frequently after that. Okasaki explained why
he didnt provide red-black tree deletion in his work in [3]. One reason is that
deletions are much messier than insertions.
The purpose of this section is just to show that red-black tree deletion is
possible in purely functional settings, although it actually rebuilds the tree
because trees are read only in terms of purely functional data structure. In real
world, its up to the user (or actually the programmer) to adopt the proper
solution. One option is to mark the node be deleted with a flag, and perform
a tree rebuilding when the number of deleted nodes exceeds 50% of the total
number of nodes.
Not only in functional settings, even in imperative settings, deletion is more
complex than insertion. We face more cases to fix. Deletion may also violate the
red black tree properties, so we need fix it after the normal deletion as described
in binary search tree.
The deletion algorithm in this book are based on top of a handout in [5].
The problem only happens if you try to delete a black node, because it will
violate the 4-th property of red-black tree, which means the number of black
node in the path may decreased so that it is not uniformed black-height any
more.
When delete a black node, we can resume red-black property number 4 by
introducing a doubly-black concept[2]. It means that the although the node
is deleted, the blackness is kept by storing it in the parent node. If the parent
node is red, it turns to black, However, if it has been already black, it turns to
doubly-black.
In order to express the doubly-black node, The definition need some modification accordingly.
data Color = R | B | BB -- BB: doubly black for deletion
data RBTree a = Empty | BBEmpty -- doubly black empty
| Node Color (RBTree a) a (RBTree a)
When deleting a node, we first perform the same deleting algorithm in binary search tree mentioned in previous chapter. After that, if the node to be
sliced out is black, we need fix the tree to keep the red-black properties. Lets
denote the empty tree as , and for non-empty tree, it can be decomposed to
node(Color, L, Key, R) for its color, left sub-tree, key and the right sub-tree.
The delete function is defined as the following.
(3.7)
f
ixBlack
(node(C,
del(L,
k),
Key,
R))
f ixBlack
mkBlk(R) : C = B
del(T, k) =
=
R : otherwise
mkBlk(L)
: C=B
L
: otherwise
: T =
: k < Key
: k > Key
: L=
(3.8)
: R=
: otherwise
The real deleting happens inside function del. For the trivial case, that the
tree is empty, the deletion result is ; If the key to be deleted is less than the
key of the current node, we recursively perform deletion on its left sub-tree; if
it is bigger than the key of the current node, then we recursively delete the key
from the right sub-tree; Because it may bring doubly-blackness, so we need fix
it.
If the key to be deleted is equal to the key of the current node, we need
splice it out. If one of its children is empty, we just replace the node by the
other one and reserve the blackness of this node. otherwise we cut and past the
minimum element k 0 = min(R) from the right sub-tree.
Function delete just forces the result tree of del to have a black root. This
is realized by function blackenRoot.
: T =
blackenRoot(T ) =
(3.9)
node(B, L, Key, R) : otherwise
Compare with the makeBlack function, which is defined in red-black tree
insertion section, they are almost same, except for the case of empty tree. This
is only valid in deletion, because insertion cant result an empty tree, while
deletion may.
Function mkBlk is defined to reserved the blackness of a node. If the node
to be sliced isnt black, this function wont be applied, otherwise, it turns a red
node to black and turns a black node to doubly-black. This function also marks
an empty tree to doubly-black empty.
: T =
node(B, L, Key, R) : C = R
mkBlk(T ) =
(3.10)
node(B 2 , L, Key, R) : C = B
T : otherwise
where means doubly-black empty node and B2 is the doubly-black color.
Summarizing the above functions yields the following Haskell program.
delete::(Ord a)RBTree a a RBTree a
delete t x = blackenRoot(del t x) where
del Empty _ = Empty
del (Node color l k r) x
| x < k = fixDB color (del l x) k r
| x > k = fixDB color l k (del r x)
-- x == k, delete this node
| isEmpty l = if color==B then makeBlack r else r
3.4. DELETION
71
a RBTree a
l k r) = Node BB l k r -- doubly black
l k r) = Node B l k r
BBEmpty
The final attack to the red-black tree deletion algorithm is to realize the
f ixBlack2 function. The purpose of this function is to eliminate the doublyblack colored node by rotation and color changing.
Lets solve the doubly-black empty node first. For any node, If one of its
child is doubly-black empty, and the other child is non-empty, we can safely
replace the doubly-black empty with a normal empty node.
Like figure 3.7, if we are going to delete the node 4 from the tree (Instead
show the whole tree, only part of the tree is shown), the program will use a
doubly-black empty node to replace node 4. In the figure, the doubly-black
node is shown in black circle with 2 edges. It means that for node 5, it has a
doubly-black empty left child and has a right non-empty child (a leaf node with
key 6). In such case we can safely change the doubly-black empty to normal
empty node. which wont violate any red-black properties.
3
5
2
NIL
6
1
Figure 3.7: One child is doubly-black empty node, the other child is non-empty
On the other hand, if a node has a doubly-black empty node and the other
child is empty, we have to push the doubly-blackness up one level. For example
5
2
NIL
6
4
(b) After 1 is sliced off, it is doubly-black empty. (c) We must push the doubly-blackness up
to node 2.
Figure 3.8: One child is doubly-black empty node, the other child is empty.
node(B2 , , Key, )
node(C, , Key, R)
f ixBlack2 (T ) =
node(C, L, Key, )
...
:
:
:
:
(L = R = ) (L = R = )
L = R 6=
R = L 6=
...
(3.11)
After dealing with doubly-black empty node, we need to fix the case that
the sibling of the doubly-black node is black and it has one red child. In this
situation, we can fix the doubly-blackness with one rotation. Actually there are
4 different sub-cases, all of them can be transformed to one uniformed pattern.
They are shown in the figure 3.9. These cases are described in [2] as case 3 and
case 4.
3.4. DELETION
73
Figure 3.9: Fix the doubly black by rotation, the sibling of the doubly-black
node is black, and it has one red child.
... : ...
node(C, node(B, mkBlk(A), x, B), y, node(B, C, z, D)) : p1.1
f ixBlack (T ) =
node(C, node(B, A, x, B), y, node(B, C, z, mkBlk(D))) : p1.2
... : ...
(3.12)
where p1.1 and p1.2 each represent 2 patterns as the following.
2
p1.1 =
p1.2 =
...
mkBlk(node(C, mkBlk(A), x, node(R, B, y, C)))
f ixBlack (T ) =
mkBlk(node(C, node(R, A, x, B), y, mkBlk(C)))
...
2
:
:
:
:
...
p2.1
p2.2
...
(3.13)
...
f ixBlack2 (B, f ixBlack2 (node(R, A, x, B), y, C)
f ixBlack (T ) =
f ixBlack2 (B, A, x, f ixBlack2 (node(R, B, y, C))
T
2
:
:
:
:
...
p3.1
p3.2
otherwise
(3.14)
3.4. DELETION
75
=
b
(a) Color of x can be either black or red. (b) If x was red, then it becomes black, otherwise, it becomes doubly-black.
y
=
a
(c) Color of y can be either black or red. (d) If y was red, then it becomes black, otherwise, it becomes doubly-black.
The deletion algorithm takes O(lg N ) time to delete a key from a red-black
tree with N nodes.
Exercise 3.4
3.5
77
We almost finished all the content in this chapter. By induction the patterns, we
can implement the red-black tree in a simple way compare to the imperative tree
rotation solution. However, we should show the comparator for completeness.
For insertion, the basic idea is to use the similar algorithm as described in
binary search tree. And then fix the balance problem by rotation and return
the final result.
1: function Insert(T, k)
2:
root T
3:
x Create-Leaf(k)
4:
Color(x) RED
5:
parent N IL
6:
while T 6= N IL do
7:
parent T
8:
if k < Key(T ) then
9:
T Left(T )
10:
else
11:
T Right(T )
12:
Parent(x) parent
13:
if parent = N IL then
. tree T is empty
14:
return x
15:
else if k < Key(parent) then
16:
Left(parent) x
17:
else
18:
Right(parent) x
19:
return Insert-Fix(root, x)
The only difference from the binary search tree insertion algorithm is that
we set the color of the new node as red, and perform fixing before return. It is
easy to translate the pseudo code to real imperative programming language, for
instance Python 3 .
def rb_insert(t, key):
root = t
x = Node(key)
parent = None
while(t):
parent = t
if(key < [Link]):
t = [Link]
else:
t = [Link]
if parent is None: #tree is empty
root = x
elif key < [Link]:
parent.set_left(x)
else:
parent.set_right(x)
return rb_insert_fix(root, x)
3 C,
and C++ source codes are available along with this book
79
Figure 3.12 shows the results of feeding same series of keys to the above
python insertion program. Compare them with figure 3.6, one can tell the
difference clearly.
11
5
14
2
15
(a)
(b)
Exercise 3.5
Implement the red-black tree deleting algorithm in your favorite imperative programming language. you can refer to [2] for algorithm details.
3.6
More words
Bibliography
[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford
Stein. Introduction to Algorithms, Second Edition. ISBN:0262032937.
The MIT Press. 2001
[2] Chris Okasaki. FUNCTIONAL PEARLS Red-Black Trees in a Functional
Setting. J. Functional Programming. 1998
[3] Chris Okasaki. Ten Years of Purely Functional Data Structures.
[Link]
[4] Wikipedia. Red-black tree. [Link] tree
[5] Lyn Turbak. Red-Black Trees. [Link]/ cs231/fall01/[Link] Nov. 2, 2001.
[6] SGI STL. [Link]
[7] Pattern matching. [Link] matching
81
82
AVL tree
Chapter 4
AVL tree
4.1
4.1.1
Introduction
How to measure the balance of a tree?
Besides red-black tree, are there any other intuitive solutions of self-balancing
binary search tree? In order to measure how balancing a binary search tree,
one idea is to compare the height of the left sub-tree and right sub-tree. If
they differs a lot, the tree isnt well balanced. Lets denote the difference height
between two children as below
(T ) = |L| |R|
(4.1)
Where |T | means the height of tree T , and L, R denotes the left sub-tree
and right sub-tree.
If (T ) = 0, The tree is definitely balanced. For example, a complete binary
tree has N = 2h 1 nodes for height h. There is no empty branches unless the
leafs. Another trivial case is empty tree. () = 0. The less absolute value of
(T ) the more balancing the tree is.
We define (T ) as the balance factor of a binary search tree.
4.2
An AVL tree is a special binary search tree, that all sub-trees satisfying the
following criteria.
|(T )| 1
(4.2)
The absolute value of balance factor is less than or equal to 1, which means
there are only three valid values, -1, 0 and 1. Figure 4.1 shows an example AVL
tree.
Why AVL tree can keep the tree balanced? In other words, Can this definition ensure the height of the tree as O(lg N ) where N is the number of the
nodes in the tree? Lets prove this fact.
For an AVL tree of height h, The number of nodes varies. It can have at
most 2h 1 nodes for a complete binary tree. We are interesting about how
83
84
10
(4.3)
We immediately know that, there must be one child has height h 1. Lets
say height(A) = h 1. According to the definition of AVL tree, we have.
|height(A) height(B)| 1. This leads to the fact that the height of other
tree B cant be lower than h 2, So the total number of the nodes of T is the
number of nodes in tree A, and B plus 1 (for the root node). We exclaim that.
N (h) = N (h 1) + N (h 2) + 1
(4.4)
(4.5)
Lemma 4.2.1. Let N (h) be the minimum number of nodes for an AVL tree
with height h. and N 0 (h) = N (h) + 1, then
N 0 (h) h
Where =
5+1
2
(4.6)
85
k
h-1
h-2
Figure 4.2: An AVL tree with height h, one of the sub-tree with height h 1,
the other is h 2
h = 0, N 0 (0) = 1 0 = 1
h = 1, N 0 (1) = 2 1 = 1.618...
For the induction case, suppose N 0 (h) h .
N 0 (h + 1)
= N 0 (h) + N 0 (h 1) {F ibonacci}
h + h1
= h1 ( + 1)
{ + 1 = 2 =
h+1
=
5+3
2 }
(4.7)
It tells that the height of AVL tree is proportion to O(lg N ), which means
that AVL tree is balanced.
During the basic mutable tree operations such as insertion and deletion, if
the balance factor changes to any invalid value, some fixing has to be performed
to resume || within 1. Most implementations utilize tree rotations. In this
chapter, well show the pattern matching solution which is inspired by Okasakis
red-black tree solution[2]. Because of this modify-fixing approach, AVL tree is
also a kind of self-balancing binary search tree. For comparison purpose, well
also show the procedural algorithms.
Of course we can compute the value recursively, another option is to store
the balance factor inside each nodes, and update them when we modify the tree.
The latter one avoid computing the same value every time.
Based on this idea, we can add one data field to the original binary search
tree as the following C++ code example 1 .
template <class T>
struct node{
int delta;
T key;
node left;
1 Some
86
node right;
node parent;
};
The immutable operations, including looking up, finding the maximum and
minimum elements are all same as the binary search tree. Well skip them and
focus on the mutable operations.
4.3
Insertion
Insert a new element to an AVL tree may violate the AVL tree property that the
absolute value exceeds 1. To resume it, one option is to do the tree rotation
according to the different insertion cases. Most implementation is based on this
approach
Another way is to use the similar pattern matching method mentioned by
Okasaki in his red-black tree implementation [2]. Inspired by this idea, it is
possible to provide a simple and intuitive solution.
When insert a new key to the AVL tree, the balance factor of the root may
changes in range [1, 1], and the height may increase at most by one, which
we need recursively use this information to update the value in upper level
nodes. We can define the result of the insertion algorithm as a pair of data
(T 0 , H). Where T 0 is the new tree and H is the increment of height. Lets
denote function f irst(pair) which can return the first element in a pair. We
can modify the binary search tree insertion algorithm as the following to handle
AVL tree.
insert(T, k) = f irst(ins(T, k))
(4.8)
where
ins(T, k) =
(node(, k, , 0), 1) :
(tree(ins(L, k), Key, (R, 0)), ) :
T =
k < Key
otherwise
(4.9)
L, R, Key, represent the left child, right child, the key and the balance
factor of a tree.
L = lef t(T )
R = right(T )
Key = key(T )
= (T )
4.3. INSERTION
87
When we insert a new key k to a AVL tree T , if the tree is empty, we just
need create a leaf node with k, set the balance factor as 0, and the height is
increased by one. This is the trivial case. Function node() is defined to build a
tree by taking a left sub-tree, a right sub-tree, a key and a balance factor.
If T isnt empty, we need compare the Key with k. If k is less than the key,
we recursively insert it to the left child, otherwise we insert it into the right
child.
As we defined above, the result of the recursive insertion is a pair like
(L0 , Hl ), we need do balancing adjustment as well as updating the increment
of height. Function tree() is defined to dealing with this task. It takes 4 parameters as (L0 , Hl ), Key, (R0 , Hr ), and . The result of this function is
defined as (T 0 , H), where T 0 is the new tree after adjustment, and H is the
new increment of height which is defined as
H = |T 0 | |T |
(4.10)
= |T 0 | |T |
= 1 + max(|R0 |, |L0 |) (1 + max(|R|, |L|))
0
= max(|R
|, |L0 |) max(|R|, |L|)
Hr : 0 0 0
+ Hr : 0 0 0
=
Hl : 0 0 0
Hl : otherwise
(4.11)
To prove this equation, note the fact that the height cant increase both in
left and right with only one insertion.
These 4 cases can be explained from the definition of balance factor definition
that it equal to the difference from the right sub tree and left sub tree.
If 0 and 0 0, it means that the height of right sub tree isnt less
than the height of left sub tree both before insertion and after insertion.
In this case, the increment in height of the tree is only contributed from
the right sub tree, which is Hr .
If 0, which means the height of left sub tree isnt less than the height
of right sub tree before, and it becomes 0 0, which means that the
height of right sub tree increases due to insertion, and the left side keeps
same (|L0 | = |L|). So the increment in height is
H
{ 0 0 0}
{|L| = |L0 |}
{ 0 0 0}
88
The next problem in front of us is how to determine the new balancing factor
value 0 before performing balancing adjustment. According to the definition
of AVL tree, the balancing factor is the height of right sub tree minus the height
of right sub tree. We have the following facts.
0
= |R0 | |L0 |
= |R| + Hr (|L| + Hl )
= |R| |L| + Hr Hl
= + Hr Hl
(4.12)
With all these changes in height and balancing factor get clear, its possible
to define the tree() function mentioned in (4.9).
tree((L0 , Hl ), Key, (R0 , Hr ), ) = balance(node(L0 , Key, R0 , 0 ), H)
(4.13)
Before we moving into details of balancing adjustment, lets translate the
above equations to real programs in Haskell.
First is the insert function.
insert::(Ord a)AVLTree a a AVLTree a
insert t x = fst $ ins t where
ins Empty = (Br Empty x Empty 0, 1)
ins (Br l k r d)
| x<k
= tree (ins l) k (r, 0) d
| x == k
= (Br l k r d, 0)
| otherwise = tree (l, 0) k (ins r) d
Here we also handle the case that inserting a duplicated key (which means
the key has already existed.) as just overwriting.
tree::(AVLTree a, Int) a (AVLTree a, Int) Int (AVLTree a, Int)
tree (l, dl) k (r, dr) d = balance (Br l k r d, delta) where
d = d + dr - dl
delta = deltaH d d dl dr
4.3.1
Balancing adjustment
4.3. INSERTION
89
Figure 4.3 shows the 4 cases which need fix. For all these 4 cases the balancing factors are either -2, or +2 which exceed the range of [1, 1]. After
balancing adjustment, this factor turns to be 0, which means the height of left
sub tree is equal to the right sub tree.
(z) = 2
(x) = 2
(y) = 1
y
@
B
(x) = 1
@
I
@
(x) = 2
@
A
(z) = 1
@
R
@
x
(z) = 2
(y) = 1
0 (y) = 0
90
(4.14)
= |D| |C|
{F rom(4.14)}
= |y| 2 (|x| 1)
= |y| |x| 1
{x is child of y |y| |x| = 1}
=0
(4.15)
After fixing.
0 (z)
(4.16)
Summarize the above results, the left-left lean case adjust the balancing
factors as the following.
0 (x) = (x)
0 (y) = 0
0 (z) = 0
(4.17)
(4.18)
(4.19)
4.3. INSERTION
91
{By (4.20)}
(4.21)
(4.22)
(4.23)
If (y) 6= 1, it means max(|B|, |C|) = |B|, taking this into (4.21), yields.
|B| |A| = 0 {By (4.19)}
0 (x) = 0
(4.24)
Summarize these 2 cases, we get relationship of 0 (x) and (y) as the following.
0 (x) =
1 : (y) = 1
0 : otherwise
(4.25)
(4.26)
1 :
0 :
(y) = 1
otherwise
(4.27)
(4.28)
92
{|C| = |D|}
{From (4.25): 0 (x) = 1 |B| |A| = 1}
{(y) = 1 |C| |B| = 1}
{|A| = |B|}
{From (4.27): |D| |C| = 1}
{(y) = 1 |C| |B| = 1}
4.3.2
Pattern Matching
All the problems have been solved and its time to define the final pattern
matching fixing function.
(T, H) :
(4.30)
Where Pll (T ) means the pattern of tree T is left-left lean respectively. 0 (x)
and delta0 (z) are defined in (4.29). The four patterns are tested as below.
Pll (T ) = node(node(node(A, x, B, (x)), y, C, 1), z, D, 2)
Prr (T ) = node(A, x, node(B, y, node(C, z, D, (z)), 1), 2)
Prl (T ) = node(node(A, x, node(B, y, C, (y)), 1), z, D, 2)
Plr (T ) = node(A, x, node(node(B, y, C, (y)), z, D, 1), 2)
(4.31)
Translating the above function definition to Haskell yields a simple and intuitive program.
Pll (T )
Prr (T )
Prl (T ) Plr (T )
otherwise
4.3. INSERTION
93
_) =
_) =
_) =
where
_) =
where
The insertion algorithm takes time proportion to the height of the tree, and
according to the result we proved above, its performance is O(lg N ) where N is
the number of elements stored in the AVL tree.
Verification
One can easily create a function to verify a tree is AVL tree. Actually we need
verify two things, first, its a binary search tree; second, it satisfies AVL tree
property.
We left the first verification problem as an exercise to the reader.
In order to test if a binary tree satisfies AVL tree property, we can test
the difference in height between its two children, and recursively test that both
children conform to AVL property until we arrive at an empty leaf.
avl?(T ) =
T rue
avl?(L) avl?(R) ||R| |L|| 1
: T =
: otherwise
(4.32)
And the height of a AVL tree can also be calculate from the definition.
|T | =
0 : T =
1 + max(|R|, |L|) : otherwise
(4.33)
Exercise 4.1
Write a program to verify a binary tree is a binary search tree in your
favorite programming language. If you choose to use an imperative language,
please consider realize this program without recursion.
94
4.4
Deletion
As we mentioned before, deletion doesnt make significant sense in purely functional settings. As the tree is read only, its typically performs frequently looking
up after build.
Even if we implement deletion, its actually re-building the tree as we presented in chapter of red-black tree. We left the deletion of AVL tree as an
exercise to the reader.
Exercise 4.2
Take red-black tree deletion algorithm as an example, write the AVL tree
deletion program in purely functional approach in your favorite programming language.
Write the deletion algorithm in imperative approach in your favorite programming language.
4.5
We almost finished all the content in this chapter about AVL tree. However, it
necessary to show the traditional insert-and-rotate approach as the comparator
to pattern matching algorithm.
Similar as the imperative red-black tree algorithm, the strategy is first to do
the insertion as same as for binary search tree, then fix the balance problem by
rotation and return the final result.
1: function Insert(T, k)
2:
root T
3:
x Create-Leaf(k)
4:
(x) 0
5:
parent N IL
6:
while T 6= N IL do
7:
parent T
8:
if k < Key(T ) then
9:
T Left(T )
10:
else
11:
T Right(T )
12:
Parent(x) parent
13:
if parent = N IL then
. tree T is empty
14:
return x
15:
else if k < Key(parent) then
16:
Left(parent) x
17:
else
18:
Right(parent) x
19:
return AVL-Insert-Fix(root, x)
Note that after insertion, the height of the tree may increase, so that the
balancing factor may also change, insert on right side will increase by 1,
while insert on left side will decrease it. By the end of this algorithm, we need
perform bottom-up fixing from node x towards root.
95
This is a top-down algorithm search the tree from root down to the proper
position and insert the new key as a leaf. By the end of this algorithm, it calls
fixing procedure, by passing the root and the new node inserted.
Note that we reuse the same methods of set left() and set right() as we
defined in chapter of red-black tree.
In order to resume the AVL tree balance property by fixing, we first determine if the new node is inserted on left hand or right hand. If it is on left, the
balancing factor decreases, otherwise it increases. If we denote the new value
as 0 , there are 3 cases of the relationship between and 0 .
If || = 1 and | 0 | = 0, this means adding the new node makes the tree
perfectly balanced, the height of the parent node doesnt change, the algorithm can be terminated.
If || = 0 and | 0 | = 1, it means that either the height of left sub tree or
right sub tree increases, we need go on check the upper level of the tree.
If || = 1 and | 0 | = 2, it means the AVL tree property is violated due to
the new insertion. We need perform rotation to fix it.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
function AVL-Insert-Fix(T, x)
while Parent(x) 6= N IL do
(Parent(x))
if x = Left(Parent(x)) then
0 1
else
0 + 1
(Parent(x)) 0
P Parent(x)
L Left(x)
R Right(x)
2C
and C++ source code are available along with this book
96
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
97
98
We skip the AVL tree deletion algorithm and left this as an exercise to the
reader.
4.6
Chapter note
AVL tree was invented in 1962 by Adelson-Velskii and Landis[3], [4]. The name
AVL tree comes from the two inventors name. Its earlier than red-black tree.
Its very common to compare AVL tree and red-black tree, both are selfbalancing binary search trees, and for all the major operations, they both consume O(lg N ) time. From the result of (4.7), AVL tree is more rigidly balanced
hence they are faster than red-black tree in looking up intensive applications
[3]. However, red-black trees could perform better in frequently insertion and
removal cases.
Many popular self-balancing binary search tree libraries are implemented on
top of red-black tree such as STL etc. However, AVL tree provides an intuitive
and effective solution to the balance problem as well.
After this chapter, well extend the tree data structure from storing data in
node to storing information on edges, which leads to Trie and Patrica, etc. If
we extend the number of children from two to more, we can get B-tree. These
data structures will be introduced next.
Bibliography
[1] [Link] [Link]
[2] Chris Okasaki. FUNCTIONAL PEARLS Red-Black Trees in a Functional
Setting. J. Functional Programming. 1998
[3] Wikipedia. AVL tree. [Link] tree
[4] Guy Cousinear, Michel Mauny. The Functional Approach to Programming. Cambridge University Press; English Ed edition (October 29, 1998).
ISBN-13: 978-0521576819
[5] Pavel Grafov. Implementation of an
[Link]
99
AVL
tree
in
Python.
100
Chapter 5
Introduction
The binary trees introduced so far store information in nodes. Its possible to
store the information in edges. Trie and Patricia are important data structures
in information retrieving and manipulating. They were invented in 1960s. And
are widely used in compiler design[2], and bio-information area, such as DNA
pattern matching [3].
0
1
10
011
100
1011
102
from left), check if it is 0 or 1, if it is 0, we turn left; and turn right for 1. Then
we take the second bit and repeat this search till either meet a leaf or finish all
n bits.
The radix tree neednt store keys in node at all. The information is represented by edges. The nodes marked with keys in the above figure are only for
illustration purpose.
It is very natural to come to the idea is it possible to represent key in
integer instead of string? Because integer can be written in binary format,
such approach can save spaces. Another advantage is that the speed is fast
because we can use bit-wise manipulation in most programming environment.
5.2
Integer Trie
The data structure shown in figure 5.1 is called as binary trie. Trie is invented
by Edward Fredkin. It comes from retrieval, pronounce as /tri:/ by the
inventor, while it is pronounced /trai/ try by other authors [5]. Trie is also
called prefix tree or radix tree. A binary trie is a special binary tree in which
the placement of each key is controlled by its bits, each 0 means go left and
each 1 means go right[2].
Because integers can be represented in binary format, it is possible to store
integer keys rather than 0, 1 strings. When insert an integer as the new key to
the trie, we change it to binary form, then examine the first bit, if it is 0, we
recursively insert the rest bits to the left sub tree; otherwise if it is 1, we insert
into the right sub tree.
There is a problem when treat the key as integer. Consider a binary trie
shown in figure 5.2. If represented in 0, 1 strings, all the three keys are different.
But they are identical when turn into integers. Where should we insert decimal
3, for example, to the trie?
One approach is to treat all the prefix zero as effective bits. Suppose the
integer is represented with 32-bits, If we want to insert key 1, it ends up with
a tree of 32 levels. There are 31 nodes, each only has the left sub tree. the last
node only has the right sub tree. It is very inefficient in terms of space.
Okasaki shows a method to solve this problem in [2]. Instead of using bigendian integer, we can use the little-endian integer to represent key. Thus
decimal integer 1 is represented as binary 1. Insert it to the empty binary trie,
the result is a trie with a root and a right leaf. There is only 1 level. decimal 2
is represented as 01, and decimal 3 is (11)2 in little-endian binary format. There
is no need to add any prefix 0, the position in the trie is uniquely determined.
5.2.1
In order to define the little-endian binary trie, we can reuse the structure of
binary tree. A binary trie node is either empty, or a branch node. The branch
node contains a left child, a right node, and optional value as satellite data. The
left sub tree is encoded as 0 and the right sub tree is encoded as 1.
The following example Haskell code defines the trie algebraic data type.
data IntTrie a = Empty
| Branch (IntTrie a) (Maybe a) (IntTrie a)
103
11
011
0011
5.2.2
Insertion
Since the key is little-endian integer, when insert a key, we take the bit one by
one from the right most (LSB). If it is 0, we go to the left, otherwise go to the
right for 1. If the child is empty, we need create a new node, and repeat this to
the last bit (MSB) of the key.
1: function Insert(T, k, v)
2:
if T = NIL then
3:
T Empty-Node
4:
pT
5:
while k 6= 0 do
6:
if Even?(k) then
7:
if Left(p) = NIL then
8:
Left(p) Empty-Node
9:
p Left(p)
10:
else
11:
if Right(p) = NIL then
12:
Right(p) Empty-Node
13:
p Right(p)
104
k bk/2c
15:
Data(p) v
16:
return T
This algorithm takes 3 arguments, a Trie T , a key k, and the satellite date v.
The following Python example code implements the insertion algorithm. The
satellite data is optional, it is empty by default.
14:
Figure 5.2 shows a trie which is created by inserting pairs of key and value
{1 a, 4 b, 5 c, 9 d} to the empty trie.
1
1:a
4:b
1
5:c
1
9:d
105
key by 2 to get rid of the LSB. If the LSB is 1, the key is odd number, the
recursive insertion is applied to the right child. For trie T , denote the left and
right children as Tl and Tr respectively. Thus T = (Tl , d, Tr ), where d is the
optional satellite data. if T is empty, Tl , Tr and d are defined as empty as well.
insert(T, k, v) =
(Tl , v, Tr ) : k = 0
(insert(Tl , k/2, v), d, Tr ) : even(k)
(5.1)
If the key to be inserted already exists, this algorithm just overwrites the
previous stored data. It can be replaced with other alternatives, such as storing
data as with linked-list etc.
The following Haskell example program implements the insertion algorithm.
insert t 0 x = Branch (left t) (Just x) (right t)
insert t k x | even k = Branch (insert (left t) (k div 2) x) (value t) (right t)
| otherwise = Branch (left t) (value t) (insert (right t) (k div 2) x)
left (Branch l _ _) = l
left Empty = Empty
right (Branch _ _ r) = r
right Empty = Empty
value (Branch _ v _) = v
value Empty = Nothing
For a given integer k with m bits in binary, the insertion algorithm reuses
m levels. The performance is bound to O(m) time.
5.2.3
Look up
To look up key k in the little-endian integer binary trie. We take each bit of
k from the left (LSB), then go left if this bit is 0, otherwise, we go right. The
looking up completes when all bits are consumed.
1: function Lookup(T, k)
2:
while x 6= 0 T 6=NIL do
3:
if Even?(x) then
4:
T Left(T )
5:
else
6:
T Right(T )
7:
k bk/2c
8:
if T 6= NIL then
9:
return Data(T )
10:
else
11:
return not found
Below Python example code uses bit-wise operation to implements the looking up algorithm.
def lookup(t, key):
while key != 0 and (t is not None):
if key & 1 == 0:
106
Looking up can also be define in recursive manner. If the tree is empty, the
looking up fails; If k = 0, the satellite data is the result to be found; If the last
bit is 0, we recursively look up the left child; otherwise, we look up the right
child.
: T =
d : k=0
lookup(T, k) =
(5.2)
lookup(T
,
k/2)
: even(k)
5.3
Integer Patricia
Trie has some drawbacks. It wastes a lot of spaces. Note in figure 5.2, only
leafs store the real data. Typically, the integer binary trie contains many nodes
only have one child. One improvement idea is to compress the chained nodes
together. Patricia is such a data structure invented by Donald R. Morrison
in 1968. Patricia means practical algorithm to retrieve information coded in
alphanumeric[3]. It is another kind of prefix tree.
Okasaki gives implementation of integer Patricia in [2]. If we merge the
chained nodes which have only one child together in figure 5.3, We can get a
Patricia as shown in figure 5.4.
From this figure, we can find that the key for the sibling nodes is the longest
common prefix for them. They branches out at certain bit. Patricia saves a lot
of spaces compare to trie.
Different from integer trie, using the big-endian integer in Patricia doesnt
cause the padding zero problem mentioned in section 5.2. All zero bits before
MSB are omitted to save space. Okasaki lists some significant advantages of
big-endian Patricia[2].
5.3.1
Definition
107
001
4:b
1
1:a
0
01
9:d
1
5:c
In order to tell from which bit the left and right children differ, a mask is
recorded in the branch node. Typically, a mask is power of 2, like 2n for some
non-negative integer n, all bits being lower than n dont belong to the common
prefix.
The following Python example code defines Patricia as well as some auxiliary
functions.
class IntTree:
def __init__(self, key = None, value = None):
[Link] = key
[Link] = value
[Link] = [Link] = None
[Link] = [Link] = None
def set_children(self, l, r):
[Link] = l
[Link] = r
def replace_child(self, x, y):
if [Link] == x:
[Link] = y
108
5.3.2
Insertion
When insert a key, if the tree is empty, we can just create a leaf node with the
given key and satellite data, as shown in figure 5.5).
NIL
12
Figure 5.5: Left: the empty tree; Right: After insert key 12.
If the tree is just a singleton leaf node x, we can create a new leaf y, put the
key and data into it. After that, we need create a new branch node, and set x
and y as the two children. In order to determine if the y should be the left or
right node, we need find the longest common prefix of x and y. For example
if key(x) is 12 ((1100)2 in binary), key(y) is 15 ((1111)2 in binary), then the
longest common prefix is (11oo)2 . Where o denotes the bits we dont care about.
We can use another integer to mask those bits. In this case, the mask number
is 4 (100 in binary). The next bit after the longest common prefix presents 21 .
This bit is 0 in key(x), while it is 1 in key(y). We should set x as the left child
and y as the right child. Figure 5.6 shows this example.
In case the tree is neither empty, nor a singleton leaf, we need firstly check if
the key to be inserted matches the longest common prefix recorded in the root.
Then recursively insert the key to the left or the right child according to the
next bit of the common prefix. For example, if insert key 14 ((1110)2 in binary)
to the result tree in figure 5.6, since the common prefix is (11oo)2 , and the next
bit (the bit of 21 ) is 1, we need recursively insert to the right child.
If the key to be inserted doesnt match the longest common prefix stored in
the root, we need branch a new leaf out. Figure 5.7 shows these two different
cases.
For a given key k and value v, denote (k, v) is the leaf node. For branch
node, denote it in form of (p, m, Tl , Tr ), where p is the longest common prefix,
m is the mask, Tl and Tr are the left and right children. Summarize the above
109
prefix=1100
mask=100
12
12
15
Figure 5.6: Left: A tree with singleton leaf 12; Right: After insert key 15.
prefix=1100
mask=100
prefix=1100
mask=100
12
15
12
prefix=1110
mask=10
14
15
prefix=1100
mask=100
12
prefix=0
mask=10000
15
prefix=1110
mask=10
12
15
110
= T = (k, v 0 )
= (k 0 , v 0 )
= (p, m, Tl , Tr ), match(k, p, m), zero(k, m)
insert(T, k, v) =
= (p, m, Tl , Tr ), match(k, p, m)
(5.3)
The first clause deals with the edge cases, that either T is empty or it is a
leaf node with the same key. The algorithm overwrites the previous value for
the later case.
The second clause handles the case that T is a leaf node, but with different
key. Here we need branch out another new leaf. We need extract the longest
common prefix, and determine which leaf should be set as the left, and which
should be set as the right child. Function join(k1 , T1 , K2 , T2 ) does this work.
Well define it later.
The third clause deals with the case that T is a branch node, the longest
common prefix matches the key to be inserted, and the next bit to the common
prefix is zero. Here we need recursively insert to the left child.
The fourth clause handles the similar case as the third clause, except that
the next bit to the common prefix is one, but not zero. We need recursively
insert to the right child.
The last clause is for the case that the key to be inserted doesnt match the
longest common prefix stored in the branch. We need branch out a new leaf by
calling the join function.
We need define function match(k, p, m) to test if the key k, has the same
prefix p above the masked bits m. For example, suppose the prefix stored in
a branch node is (pn pn1 ...pi ...p0 )2 in binary, key k is (kn kn1 ...ki ...k0 )2 in
binary, and the mask is (100...0)2 = 2i . They match if and only if pj = kj for
all i j n.
One solution to realize match is to test if mask(k, m) = p is satisfied. Where
mask(x, m) = m 1&x, that we perform bitwise-not of m 1, then perform
bitwise-and with x.
Function zero(k, m) test the next bit of the common prefix is zero. With the
help of the mask m, we can shift m one bit to the right, then perform bitwise
and with the key.
(k, v)
join(k, (k, v), k 0 , T )
(p, m, insert(Tl , k, v), Tr )
(p, m, Tl , insert(Tr , k, v))
join(k, (k, v), p, T )
:
:
:
:
:
T
T
T
T
T
(5.4)
If the mask m = (100..0)2 = 2i , k = (kn kn1 ...ki 1...k0 )2 , because the bit
next to ki is 1, zero(k, m) returns false value; if k = (kn kn1 ...ki 0...k0 )2 , then
the result is true.
Function join(p1 , T1 , p2 , T2 ) takes two different prefixes and trees. It extracts
the longest common prefix of p1 and p2 , create a new branch node, and set T1
and T2 as the two children.
join(p1 , T1 , p2 , T2 ) =
(p, m, T1 , T2 ) :
(p, m, T2 , T1 ) :
111
(5.6)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
112
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
113
return t
The auxiliary functions, match, branch, lcp etc. are given as below.
def maskbit(x, mask):
return x & (~(mask-1))
def match(key, tree):
return (not tree.is_leaf()) and maskbit(key, [Link]) == [Link]
def zero(x, mask):
return x & (mask>>1) == 0
def lcp(p1, p2):
diff = (p1 ^ p2)
mask=1
while(diff!=0):
diff>>=1
mask< 1
return (maskbit(p1, mask), mask)
def branch(t1, t2):
t = IntTree()
([Link], [Link]) = lcp(t1.get_prefix(), t2.get_prefix())
if zero(t1.get_prefix(), [Link]):
t.set_children(t1, t2)
else:
t.set_children(t2, t1)
return t
Figure 5.8 shows the example Patricia created with the insertion algorithm.
prefix=0
mask=8
1:x
prefix=100
mask=2
0
4:y
1
5:z
114
5.3.3
Look up
Consider the property of integer Patricia tree. When look up a key, if it has
common prefix with the root, then we check the next bit. If this bit is zero, we
recursively look up the left child; otherwise if the bit is one, we next look up
the right child.
When reach a leaf node, we can directly check if the key of the leaf is equal
to what we are looking up. This algorithm can be described with the following
pseudo code.
1: function Look-Up(T, k)
2:
if T = NIL then
3:
return N IL
. Not found
4:
while T is not leaf, and Match(k, Prefix(T ), Mask(T )) do
5:
if Zero?(k, Mask(T )) then
6:
T Left(T )
7:
else
8:
T Right(T )
9:
if T is leaf, and Key(T ) = k then
10:
return Data(T )
11:
else
12:
return N IL
. Not found
Below Python example program implements the looking up algorithm.
def lookup(t, key):
if t is None:
return None
while (not t.is_leaf()) and match(key, t):
if zero(key, [Link]):
t = [Link]
else:
t = [Link]
if t.is_leaf() and [Link] == key:
return [Link]
else:
return None
lookup(T, k) =
v
lookup(Tl , k)
lookup(Tr , k)
:
:
:
:
:
T = (T = (k 0 , v), k 0 6= k)
T = (k 0 , v), k 0 = k
T = (p, m, Tl , Tr ), match(k, p, m), zero(k, m)
T = (p, m, Tl , Tr ), match(k, p, m), zero(k, m)
otherwise
(5.7)
115
5.4
Alphabetic Trie
Integer based Trie and Patricia Tree can be a good start point. Such technical
plays important role in Compiler implementation. Okasaki pointed that the
widely used Glasgow Haskell Compiler, GHC, utilized the similar implementation for several years before 1998 [2].
If we extend the key from integer to alphabetic value, Trie and Patricia tree
can be very powerful in solving textual manipulation problems.
5.4.1
Definition
Its not enough to just use the left and right children to represent alphabetic
keys. Using English for example, there are 26 letters and each can be lower or
upper case. If we dont care about the case, one solution is to limit the number
of branches (children) to 26. Some simplified ANSI C implementations of Trie
are defined by using the array of 26 letters. This can be illustrated as in Figure
5.9.
Not all branch nodes contain data. For instance, in Figure 5.9, the root
only has three non-empty branches representing letter a, b, and z. Other
branches such as for letter c, are all empty. We dont show empty branches in
the rest of this chapter.
If deal with case sensitive problems, or handle languages other than English,
there can be more letters than 26. The problem of dynamic size of sub branches
can be solved by using some collection data structures. Such as Hash table or
map.
A alphabetic trie is either empty or a node. There are two types of node.
A leaf node dont has any sub trees;
A branch node contains multiple sub trees. Each sub tree is bound to a
character.
Both leaf and branch may contain optional satellite data. The following
Haskell code shows the example definition.
data Trie a = Trie { value :: Maybe a
, children :: [(Char, Trie a)]}
empty = Trie Nothing []
116
nil
...
an
boy
zoo
l
bool
r
another
Figure 5.9: A Trie with 26 branches, containing key a, an, another, bool,
boy and zoo.
117
Below ANSI C code defines the alphabetic trie. For illustration purpose
only, we limit the character set to lower case English letters, from a to z.
struct Trie {
struct Trie children[26];
void data;
};
5.4.2
Insertion
When insert string as key, start from the root, we pick the character one by
one from the string, examine which child represents the character. If the corresponding child is empty, a new empty node is created. After that, the next
character is used to select the proper grand child.
We repeat this process for all the characters, and finally store the optional
satellite data in the node we arrived at.
Below pseudo code describes the insertion algorithm.
1: function Insert(T, k, v)
2:
if T = NIL then
3:
T Empty-Node
4:
pT
5:
for each c in k do
6:
if Children(p)[c] = NIL then
7:
Children(p)[c] Empty-Node
8:
p Children(p)[c]
9:
Data(p) v
10:
return T
The following example ANSI C program implements the insertion algorithm.
struct Trie insert(struct Trie t, const char key, void value){
int c;
struct Trie p;
if(!t)
t = create_node();
for (p = t; key; ++key, p = pchildren[c]) {
c = key - a;
if (!pchildren[c])
pchildren[c] = create_node();
}
pdata = value;
return t;
}
Where function create node creates new empty node, with all children initialized to empty.
struct Trie create_node(){
struct Trie t = (struct Trie) malloc(sizeof(struct Trie));
int i;
for (i=0; i<26; ++i)
tchildren[i] = NULL;
tdata = NULL;
return t;
118
The insertion can also be realized in recursive way. Denote the key to be
inserted as K = k1 k2 ...kn , where ki is the i-th character. K 0 is the rest of
characters except k1 . v 0 is the satellite data to be inserted. The trie is in form
T = (v, C), where v is the satellite data, C = {(c1 , T1 ), (c2 , T2 ), ..., (cm , Tm )} is
the map of children. It maps from character ci to sub-tree Ti . if T is empty,
then C is also empty.
(v 0 , C) : K =
(5.8)
insert(T, K, v 0 ) =
(v, ins(C, k1 , K 0 , v 0 )) : otherwise.
If the key is empty, the previous value v is overwritten with v 0 . Otherwise,
we need check the children and perform recursive insertion. This is realized in
function ins(C, k1 , K 0 , v 0 ). It examines key-sub tree pairs in C one by one. Let
C 0 be the rest of pairs except for the first one. This function can be defined as
below.
(5.9)
5.4.3
Look up
To look up a key, we also extract the character from the key one by one. For each
character, we search among the children to see if there is a branch match this
character. If there is no such a child, the look up process terminates immediately
to indicate the not found error. When we reach the last character of the key,
The data stored in the current node is what we are looking up.
1: function Look-Up(T, key)
2:
if T = NIL then
3:
return not found
4:
for each c in key do
5:
if Children(T )[c] = NIL then
6:
return not found
7:
T Children(T )[c]
8:
return Data(T )
Below ANSI C program implements the look up algorithm. It returns NULL
to indicate not found error.
119
The look up algorithm can also be realized in recursive manner. When look
up a key, we start from the first character. If it is bound to some child, we
then recursively search the rest characters in that child. Denote the trie as
T = (v, C), the key being searched as K = k1 k2 ...kn if it isnt empty. The first
character in the key is k1 , and the rest characters are denoted as K 0 .
lookup(T, K) =
v : K=
: f ind(C, k1 ) =
lookup(T 0 , K 0 ) : f ind(C, k1 ) = T 0
(5.10)
: C=
T1 : k1 = k
(5.11)
f ind(C, k) =
f ind(C 0 , k) : otherwise
The following Haskell example program implements the trie looking up algorithm. It uses the find function provided in standard library[5].
find t [] = value t
find t (k:ks) = case lookup k (children t) of
Nothing Nothing
Just t find t ks
Exercise 5.1
Develop imperative trie by using collection data structure to manage multiple sub trees in alphabetic trie.
5.5
Alphabetic Patricia
Similar to integer trie, alphabetic trie is not memory efficient. We can use the
same method to compress alphabetic trie to Patricia.
5.5.1
Definition
Alphabetic Patricia tree is a special prefix tree, each node contains multiple
branches. All children of a node share the longest common prefix string. As the
result, there is no node with only one child, because it conflicts with the longest
common prefix property.
If we turn the trie shown in figure 5.9 into Patricia by compressing all nodes
which have only one child. we can get a Patricia prefix tree as in figure 5.10.
120
bo
zoo
n
an
zoo
ol
bool
y
boy
other
another
Figure 5.10: A Patricia prefix tree, with keys: a, an, another, bool, boy
and zoo.
We can modify the definition of alphabetic trie a bit to adapt it to Patricia.
The Patricia is either empty, or a node in form T = (v, C). Where v is the
optional satellite data; C = {(s1 , T1 ), (s2 , T2 ), ..., (sn , Tn )} is a list of pairs.
Each pair contains a string si , which is bound to a sub tree Ti .
The following Haskell example code defines Patricia accordingly.
type Key = String
data Patricia a = Patricia { value :: Maybe a
, children :: [(Key, Patricia a)]}
empty = Patricia Nothing []
Below Python code reuses the definition for trie to define Patricia.
class Patricia:
def __init__(self, value = None):
[Link] = value
[Link] = {}
5.5.2
Insertion
When insert a key, s, if the Patricia is empty, we create a leaf node as shown
in figure 5.11 (a). Otherwise, we need check the children. If there is some sub
tree Ti bound to the string si , and there exists common prefix between si and
s, we need branch out a new leaf Tj . The method is to create a new internal
branch node, bind it with the common prefix. Then set Ti as one child of this
branch, and Tj as the other child. Ti and Tj share the common prefix. This is
shown in figure 5.11 (b). However, there are two special cases, because s may
be the prefix of si as shown in figure 5.11 (c). And si may be the prefix of s as
in figure 5.11 (d).
The insertion algorithm can be described as below.
1: function Insert(T, k, v)
2:
if T = NIL then
121
bo
NIL
boy
ol
(a) Insert key boy into the empty Patri-(b) Insert key bool. A new branch
cia, the result is a leaf.
with common prefix bo is created.
another
an
p1 p2
...
other
x
p1 p2
...
another
insert
an p1
...
an p1
...
insert
other
(d) Insert another, into the node with prefix an. We recursively insert key other to the child.
122
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
}
22:
23:
24:
25:
26:
27:
T Empty-Node
pT
loop
match FALSE
for each (si , Ti ) Children(p) do
if k = si then
Value(p) v
return T
c LCP(k, si )
k1 k c
k2 si c
if c 6= NIL then
match TRUE
if k2 = NIL then
. si is prefix of k
p Ti
k k1
break
else
. Branch out a new leaf
Children(p) Children(p) { (c, Branch(k1 , v, k2 , Ti ))
Delete(Children(p), (si , Ti ))
return T
if match then
. Add a new leaf
Children(p) Children(p) { (k, Create-Leaf(v)) }
return T
return T
In the above algorithm, LCP function finds the longest common prefix of
two given strings, for example, string bool and boy has the longest common
prefix bo. The subtraction symbol - for strings gives the different part of two
strings. For example bool - bo = ol. Branch function creates a branch
node and updates keys accordingly.
The longest common prefix can be extracted by checking the characters in
the two strings one by one till there are two characters dont match.
1: function LCP(A, B)
2:
i1
3:
while i |A| i |B| A[i] = B[i] do
4:
ii+1
5:
return A[1...i 1]
There are two cases when branch out a new leaf. Branch(s1 , T1 , s2 , T2 )
takes two different keys and two trees. If s1 is empty, we are dealing the case
such as insert key an into a child bound to string another. We set T2 as the
child of T1 . Otherwise, we create a new branch node and set T1 and T2 as the
two children.
1: function Branch(s1 , T1 , s2 , T2 )
2:
if s1 = then
3:
Children(T1 ) Children(T1 ) {(s2 , T2 )}
4:
return T1
5:
T Empty-Node
123
Where the functions to find the longest common prefix, and branch out are
implemented as below.
# returns (p, s1, s2), where p is lcp, s1=s1-p, s2=s2-p
def lcp(s1, s2):
j=0
while j < len(s1) and j < len(s2) and s1[j] == s2[j]:
j += 1
return (s1[0:j], s1[j:], s2[j:])
def branch(key1, tree1, key2, tree2):
if key1 == "":
#example: insert "an" into "another"
[Link][key2] = tree2
return tree1
t = Patricia()
[Link][key1] = tree1
[Link][key2] = tree2
return t
The insertion can also be realized recursively. Start from the root, the program checks all the children to find if there is a node matches the key. Matching
means they have the common prefix. For duplicated keys, the program overwrites previous value. There are also alternative solution to handle duplicated
124
keys, such as using linked-list etc. If there is no child matches the key, the
program creates a new leaf, and add it to the children.
For Patricia T = (v, C), function insert(T, k, v 0 ) inserts key k, and value v 0
to the tree.
insert(T, k, v 0 ) = (v, ins(C, k, v 0 ))
(5.12)
{(k, (v 0 , ))}
{(k, (v , CT1 ))} C 0
ins(C, k, v 0 ) =
{branch(k, v 0 , k1 , T1 )} C 0
{(k1 , T1 )} ins(C 0 , k, v 0 )
0
:
:
:
:
C=
k1 = k
match(k1 , k)
otherwise
(5.13)
The first clause deals with the edge case of empty children. A leaf node
containing v 0 which is bound to k will be returned as the only child. The second
clause overwrites the previous value with v 0 if there is some child bound to the
same key. Note the CT1 means the children of sub tree T1 . The third clause
branches out a new leaf if the first child matches the key k. The last clause goes
on checking the rest children.
We define two keys A and B matching if they have non-empty common
prefix.
match(A, B) = A 6= B 6= a1 = b1
(5.14)
Where a1 and b1 are the first characters in A and B if they are not empty.
Function branch(k1 , v, k2 , T2 ) takes tow keys, a value and a tree. Extract the
longest common prefix k = lcp(k1 , k2 ), Denote the different part as k10 = k1 k,
k20 = k2 k. The algorithm firstly handles the edge cases that either k1 is the
prefix of k2 or k2 is the prefix of k1 . For the former case, It creates a new node
containing v, bind this node to k, and set (k20 , T2 ) as the only child; For the later
case, It recursively insert k10 and v to T2 . Otherwise, the algorithm creates a
branch node, binds it to the longest common prefix k, and set two children for
it. One child is (k20 , T2 ), the other is a leaf node containing v, and being bound
to k10 .
: A = B = a1 6= b1
{a1 } lcp(A0 , B 0 ) : a1 = b1
(5.16)
125
5.5.3
Look up
When look up a key, we cant examine the characters one by one as in trie. Start
from the root, we need search among the children to see if any one is bound
to a prefix of the key. If there is such a child, we update the key by removing
the prefix part, and recursively look up the updated key in this child. If there
arent any children bound to any prefix of the key, the looking up fails.
1: function Look-Up(T, k)
2:
if T = NIL then
3:
return not found
4:
repeat
5:
match FALSE
6:
for (ki , Ti ) Children(T ) do
7:
if k = ki then
8:
return Data(Ti )
9:
if ki is prefix of k then
10:
match TRUE
11:
k k ki
12:
T Ti
13:
break
126
14:
15:
(5.17)
If C is empty, the looking up fails; Otherwise, For C = {(k1 , T1 ), (k2 , T2 ), ..., (kn , Tn )},
we firstly examine if k is the prefix of k1 , if not the recursively check the rest
pairs denoted as C 0 .
vT1
f ind(C, k) =
lookup(T1 , k k1 )
f ind(C 0 , k)
:
:
:
:
C=
k = k1
k1 @ k
otherwise
(5.18)
5.6
127
Trie and Patricia can be used to solving some interesting problems. Integer
based prefix tree is used in compiler implementation. Some daily used software
applications have many interesting features which can be realized with trie or
Patricia. In this section, some applications are given as examples, including,
e-dictionary, word auto-completion, T9 input method etc. The commercial implementations typically do not adopt trie or Patricia directly. The solutions we
demonstrated here are for illustration purpose only.
5.6.1
Figure 5.12: E-dictionary. All candidates starting with what the user input are
listed.
A E-dictionary typically contains hundreds of thousands words. Its very expensive to performs a whole word search. Commercial software adopts complex
approaches, including caching, indexing etc to speed up this process.
Similar with e-dictionary, figure 5.13 shows a popular Internet search engine.
When user input something, it provides a candidate lists, with all items starting
with what the user has entered. And these candidates are shown in the order
of popularity. The more people search, the upper position it is in the list.
In both cases, the software provides a kind of word auto-completion mechanism. In some modern IDEs, the editor can even help users to auto-complete
program code.
Lets see how to implementation of the e-dictionary with trie or Patricia. To
simplify the problem, we assume the dictionary only supports English - English
information.
128
Figure 5.13: A search engine. All candidates starting with what user input are
listed.
A dictionary stores key-value pairs, the keys are English words or phrases,
the values are the meaning described in English sentences.
We can store all the words and their meanings in a trie, but it isnt space
effective especially when there are huge amount of items. Well use Patricia to
realize e-dictionary.
When user wants to look up word a, the dictionary does not only return
the meaning of a, but also provides a list of candidate words, which all start
with a, including abandon, about, accent, adam, ... Of course all these
words are stored in the Patricia.
If there are too many candidates, one solution is only displaying the top 10
words, and the user can browse for more.
The following algorithm reuses the looking up defined for Patricia. When it
finds a node bound to a string which is the prefix of what we are looking for, it
expands all its children until getting n candidates.
1: function Look-Up(T, k, n)
2:
if T = NIL then
3:
return
4:
pref ix NIL
5:
repeat
6:
match FALSE
7:
for (ki , Ti ) Children(T ) do
8:
if k is prefix of ki then
9:
return Expand(Ti , pref ix, n)
10:
if ki is prefix of k then
11:
match TRUE
12:
k k ki
13:
T Ti
14:
pref ix pref ix + ki
129
break
until match
return
Where function Expand(T, pref ix, n) picks n sub trees, which share the
same prefix in T . It is realized as BFS (Bread-First-Search) traverse. Chapter
search explains BFS in detail.
1: function Expand(T, pref ix, n)
2:
R
3:
Q {(pref ix, T )}
4:
while |R| < n |Q| > 0 do
5:
(k, T ) Pop(Q)
6:
if Data(T ) 6= NIL then
7:
R R {(k, Data(T ) )}
8:
for (ki , Ti ) Children(T ) do
9:
Push(Q, (k + ki , Ti ))
The following example Python program implements the e-dictionary application. When testing if a string is prefix of another one, it uses the find function
provided in standard string library.
import string
def patricia_lookup(t, key, n):
if t is None:
return None
prefix = ""
while True:
match = False
for k, tr in [Link]():
if [Link](k, key) == 0: #is prefix of
return expand(prefix+k, tr, n)
if [Link](key, k) ==0:
match = True
key = key[len(k):]
t = tr
prefix += k
break
if not match:
return None
def expand(prefix, t, n):
res = []
q = [(prefix, t)]
while len(res)<n and len(q)>0:
(s, p) = [Link](0)
if [Link] is not None:
[Link]((s, [Link]))
for k, tr in [Link]():
[Link]((s+k, tr))
return res
This algorithm can also be implemented recursively, if the string we are looking for is empty, we expand all children until getting n candidates. Otherwise
130
we recursively examine the children to find one which has prefix equal to this
string.
In programming environments supporting lazy evaluation. An intuitive solution is to expand all candidates, and take the first n on demand. Denote the
Patricia prefix tree in form T = (v, C), below function enumerates all items
starts with key k.
f indAll(T, k) =
enum(C) :
{(, v)} enum(C) :
f ind(C, k) :
k = , v =
k = , v 6=
k 6=
(5.19)
The first two clauses deal with the edge cases the the key is empty. All the
children are enumerated except for those with empty values. The last clause
finds child matches k.
For non-empty children, C = {(k1 , T1 ), (k2 , T2 ), ..., (km , Tm )}, denote the
rest pairs except for the first one as C 0 . The enumeration algorithm can be
defined as below.
: C=
mapAppend(k1 , f indAll(T1 , )) enum(C 0 ) :
(5.20)
Where mapAppend(k, L) = {(k + ki , vi )|(ki , vi ) L}. It concatenate the
prefix k in front of every key-value pair in list L.
Function f ind(C, k) is defined as the following. For empty children, the
result is empty as well; Otherwise, it examines the first child T1 which is bound
to string k1 . If k1 is equal to k, it calls mapAppend to add prefix to the keys of
all the children under T1 ; If k1 is prefix of k, the algorithm recursively find all
children start with k k1 ; On the other hand, if k is prefix of k1 , all children
under T1 are valid result. Otherwise, the algorithm by-pass the first child and
goes on find the rest children.
enum(C) =
f ind(C, k) =
mapAppend(k, f indAll(T1 , ))
mapAppend(k1 , f indAll(T1 , k k1 ))
f indAll(T1 , )
f ind(C 0 , k)
:
:
:
:
:
C=
k1 = k
k1 @ k
k @ k1
otherwise
(5.21)
Below example Haskell program implements the e-dictionary application according to the above equations.
findAll :: Patricia a Key [(Key, a)]
findAll t [] =
case value t of
Nothing enum $ children t
Just x ("", x):(enum $ children t)
where
enum [] = []
enum (p:ps) = (mapAppend (fst p) (findAll (snd p) [])) ++ (enum ps)
findAll t k = find (children t) k where
find [] _ = []
131
find (p:ps) k
| (fst p) == k
= mapAppend k (findAll (snd p) [])
| (fst p) [Link] k
= mapAppend (fst p) (findAll (snd p) (k diff (fst p)))
| k [Link] (fst p)
= findAll (snd p) []
| otherwise = find ps k
diff x y = drop (length y) x
mapAppend s lst = map (p(s++(fst p), snd p)) lst
In the lazy evaluation environment, the top n candidates can be gotten like
take(n, f indAll(T, k)). Appendix A has detailed definition of take function.
5.6.2
T9 input method
Most mobile phones around year 2000 are equipped with a key pad. Users have
quite different experience from PC when editing a short message or email. This
is because the mobile-phone key pad, or so called ITU-T key pad has much
fewer keys than PC. Figure 5.14 shows one example.
132
Compare these two methods, we can see the second one is much easier for
the end user. The only overhead is to store a dictionary of candidate words.
Method 2 is called as T9 input method, or predictive input method [6], [7].
The abbreviation T9 stands for textonym. It start with T with 9 characters.
T9 input can also be realized with trie or Patricia.
In order to provide candidate words to user, a dictionary must be prepared in
advance. Trie or Patricia can be used to store the dictionary. The commercial T9
implementations typically use complex indexing dictionary in both file system
and cache. The realization shown here is for illustration purpose only.
Firstly, we need define the T9 mapping, which maps from digit to candidate
characters.
MT 9 = { 2 abc, 3 def, 4 ghi,
5 jkl, 6 mno, 7 pqrs,
8 tuv, 9 wxyz}
(5.22)
With this mapping, MT 9 [i] returns the corresponding characters for digit i.
Suppose user input digits D = d1 d2 ...dn , If D isnt empty, denote the rest
digits except for d1 as D0 , below pseudo code shows how to realize T9 with trie.
1: function Look-Up-T9(T, D)
2:
Q {(, D, T )}
3:
R
4:
while Q 6= do
5:
(pref ix, D, T ) Pop(Q)
6:
for each c in MT 9 [d1 ] do
7:
if c Children(T ) then
8:
if D0 = then
9:
R R {pref ix + c}
10:
else
11:
Push(Q, (pref ix + c, D0 , Children(t)[c]))
12:
return R
Where pref ix + c means appending character c to the end of string pref ix.
Again, this algorithm performs BFS search with a queue Q. The queue is initialized with a tuple (pref ix, D, T ), containing empty prefix, the digit sequence
to be searched, and the trie. It keeps picking the tuple from the queue as far
as it isnt empty. Then get the candidate characters from the first digit to be
processed via the T9 map. For each character c, if there is a sub-tree bound to
it, we created a new tuple, update the prefix by appending c, using the rest of
digits to update D, and use that sub-tree. This new tuple is pushed back to the
queue for further searching. If all the digits are processed, it means a candidate
word is found. We put this word to the result list R.
The following example program in Python implements this T9 search with
trie.
T9MAP={2:"abc", 3:"def", 4:"ghi", 5:"jkl",
6:"mno", 7:"pqrs", 8:"tuv", 9:"wxyz"}
133
Because trie is not space effective, we can modify the above algorithm with
Patricia solution. As far as the queue isnt empty, the algorithm pops the tuple.
This time, we examine all the prefix-sub tree pairs. For every pair (ki , Ti ), we
convert the alphabetic prefix ki back to digits sequence D0 by looking up the T9
map. If D0 exactly matches the digits of what user input, we find a candidate
word; otherwise if the digit sequence is prefix of what user inputs, the program
creates a new tuple, updates the prefix, the digits to be processed, and the
sub-tree. Then put the tuple back to the queue for further search.
1: function Look-Up-T9(T, D)
2:
Q {(, D, T )}
3:
R
4:
while Q 6= do
5:
(pref ix, D, T ) Pop(Q)
6:
for each (ki , Ti ) Children(T ) do
7:
D0 Convert-T9(ki )
8:
if D0 @ D then
. D0 is prefix of D
0
9:
if D = D then
10:
R R {pref ix + ki }
11:
else
12:
Push(Q, (pref ix + ki , D D0 , Ti ))
13:
return R
Function Convert-T9(K) converts each character in K back to digit.
1: function Convert-T9(K)
2:
D
3:
for each c K do
4:
for each (d S) MT 9 do
5:
if c S then
6:
D D {d}
7:
break
8:
return D
The following example Python program implements the T9 input method
with Patricia.
134
T9 input method can also be realized recursively. Lets first define the trie
solution. The algorithm takes two arguments, a trie storing all the candidate
words, and a sequence of digits that is input by the user. If the sequence
is empty, the result is empty as well; Otherwise, it looks up C to find those
children which are bound to the first digit d1 according to T9 map.
f indT 9(T, D) =
{} : D =
f old(f, , lookupT 9(d1 , C)) : otherwise
(5.23)
(5.24)
Note this mapAppend function is a bit different from the previous one defined
in e-dictionary application. The first argument is a character, but not a string.
Function lookupT 9(k, C) checks all the possible characters mapped to digit
k. If the character is bound to some child in C, it is record as one candidate.
lookupT 9(d, C) = f old(g, , MT 9 [k])
(5.25)
Where
g(L, k) =
L : f ind(C, k) =
{(k, T 0 )} L : f ind(C, k) = T 0
(5.26)
135
There are few modifications when change the realization from trie to Patricia.
Firstly, the sub-tree is bound to prefix string, but not a single character.
f indT 9(T, D) =
{} : D =
f old(f, , f indP ref ixT 9(D, C)) : otherwise
(5.27)
The list for folding is given by calling function f indP ref ixT 9(D, C). And
f is also modified to reflect this change. It appends the candidate prefix D0 in
front of every result output by the recursive search, and then accumulates the
words.
f (L, (D0 , T 0 )) = mapAppend(D0 , f indT 9(T 0 , D D0 )) L
(5.28)
Function f indP ref ixT 9(D, C) examines all the children. For every pair
(ki , Ti ), if converting ki back to digits yields a prefix of D, then this pair is
selected as a candidate.
f indP ref ixT 9(D, C) = {(ki , Ti )|(ki , Ti ) C, convertT 9(ki ) @ D}
(5.29)
Function convertT 9(k) converts every alphabetic character in k back to digits according to T9 map.
convertT 9(K) = {d|c k, (d S) MT 9 c S}
(5.30)
Exercise 5.2
For the T9 input, compare the results of the algorithms realized with trie
and Patricia, the sequences are different. Why does this happen? How to
modify the algorithm so that they output the candidates with the same
sequence?
136
5.7
Summary
In this chapter, we start from the integer base trie and Patricia. The map
data structure based on integer Patricia plays an important role in Compiler
implementation. Alphabetic trie and Patricia are natural extensions. They can
be used to manipulate text information. As examples, predictive e-dictionary
and T9 input method are realized with trie or Patricia. Although these examples
are different from the real implementation in commercial software. They show
simple approaches to solve some problems. Other important data structure,
such as suffix tree, has close relationship with them. Suffix tree is introduced in
the next chapter.
Bibliography
[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. Introduction to Algorithms, Second Edition. Problem 12-1.
ISBN:0262032937. The MIT Press. 2001
[2] Chris Okasaki and Andrew Gill. Fast Mergeable Integer Maps. Workshop
on ML, September 1998, pages 77-86, [Link] andy/pub/[Link]
[3] D.R. Morrison, PATRICIA Practical Algorithm To Retrieve Information
Coded In Alphanumeric, Journal of the ACM, 15(4), October 1968, pages
514-534.
[4] Suffix Tree, Wikipedia. [Link] tree
[5] Trie, Wikipedia. [Link]
[6] T9 (predictive text), Wikipedia. [Link] (predictive text)
[7] Predictive text, Wikipedia. [Link] text
137
138
Suffix Tree
Chapter 6
Suffix Tree
6.1
Introduction
140
a bananas na
na s
nas s
nas s
(6.1)
Where function suf f ixes(S) gives all the suffixes for string S. If the string
is empty, the result is one empty string; otherwise, S itself is one suffix, the
others can be given by recursively call suf f ixes(S 0 ), where S 0 is given by drop
the first character from S.
{} : S =
suf f ixes(S) =
(6.2)
{S} suf f ixes(S 0 ) : otherwise
This solution constructs suffix tree in O(n2 ) time, for string of length n.
It totally inserts n suffixes to the tree, and each insertion takes linear time
proportion to the length of the suffix. The efficiency isnt good enough.
In this chapter, we firstly explain a fast on-line suffix trie construction solution by using suffix link concept. Because trie isnt space efficient, we next
introduce a linear time on-line suffix tree construction algorithm found by Ukkonen. and show how to solve some interesting string manipulation problems with
suffix tree.
6.2
Suffix trie
Just likes the relationship between trie and Patricia, Suffix trie has much simpler
structure than suffix tree. Figure 6.3 shows the suffix trie for banana.
Compare with figure 6.1, we can find the difference between suffix tree and
suffix trie. Instead of representing a word, every edge in suffix trie only represents a character. Thus suffix trie needs much more spaces. If we pack all nodes
which have only one child, the suffix trie is turned into a suffix tree.
We can reuse the trie definition for suffix tree. Each node is bound to a
character, and contains multiple sub trees as children. A child can be referred
from the bounded character.
141
6.2.1
For string S of length n, define Si = s1 s2 ...si . It is the prefix contains the first
i characters.
In suffix trie, each node represents a suffix string. for example in figure 6.4,
node X represents suffix a, by adding character c, node X transfers to Y
which represents suffix ac. We say node X transfer to Y with the edge of
character c[1].
Y Children(X)[c]
We also say that node X has a c-child Y . Below Python expression reflects
this concept.
y = [Link][c]
If node A in a suffix trie represents suffix si si+1 ...sn , and node B represents
suffix si+1 si+2 ...sn , we say node B represents the suffix of node A. We can
create a link from A to B. This link is defined as the suffix link of node A[1].
Suffix link is drawn in dotted style. In figure 6.4, the suffix link of node A points
to node B, and the suffix link of node B points to node C.
Suffix link is valid for all nodes except the root. We can add a suffix link
field to the trie definition. Below Python example code shows this update.
class STrie:
def __init__(self, suffix=None):
[Link] = {}
[Link] = suffix
142
root
X
c
a
C
B
o
a
A
o
Figure 6.4: Suffix trie for string cacao. Node X a, node Y ac, X
transfers to Y with character c
suffix string
s1 s2 s3 ...si
s2 s3 ...si
...
si1 si
si
6.2.2
On-line construction
For string S, Suppose we have constructed suffix trie for the i-th prefix Si =
s1 s2 ...si . Denote it as Suf f ixT rie(Si ). Lets consider how to obtain Suf f ixT rie(Si+1 )
from Suf f ixT rie(Si ).
If list all suffixes corresponding to Suf f ixT rie(Si ), from the longest (which
is Si ) to the shortest (which is empty), we can get table 6.1. There are total
i + 1 suffixes.
One solution is to append the character si+1 to every suffix in this table,
then add another empty string. This idea can be realized by adding a new child
for every node in the trie, and binding all these new child with edge of character
si+1 .
However, some nodes in Suf f ixT rie(Si ) may have the si+1 -child already.
For example, in figure 6.5, node X and Y are corresponding for suffix cac and
ac respectively. They dont have the a-child. But node Z, which represents
suffix c has the a-child already.
143
Algorithm 1 Update Suf f ixT rie(Si ) to Suf f ixT rie(Si+1 ), initial version.
1: for T Suf f ixT rie(Si ) do
2:
Children(T )[si+1 ] Create-Empty-Node
root
root
a c
Z
c
Y
c
144
When append si+1 to Suf f ixT rie(Si ). In this example si+1 is character a.
We need create new nodes for X and Y , but we neednt do this for Z.
If check the nodes one by one according to table 6.1, we can stop immediately
when meet a node which has the si+1 -child. This is because if node X in
Suf f ixT rie(Si ) has the si+1 -child, according to the definition of suffix link,
any suffix nodes X 0 of X in Suf f ixT rie(Si ) must also have the si+1 -child. In
other words, let c = si+1 , if wc is a sub-string of Si , then every suffix of wc is
also a sub-string of Si [1]. The only exception is the root, which represents for
empty string .
According to this fact, we can refine the algorithm 1 to the following.
Algorithm 2 Update Suf f ixT rie(Si ) to Suf f ixT rie(Si+1 ), second version.
1: for each T Suf f ixT rie(Si ) in descending order of suffix length do
2:
if Children(T )[si+1 ] = NIL then
3:
Children(T )[si+1 ] Create-Empty-Node
4:
else
5:
break
The next question is how to iterate all nodes in descending order of the suffix
length? Define the top of a suffix trie as the deepest leaf node. This definition
ensures the top represents the longest suffix. Along the suffix link from the top
to the next node, the length of the suffix decrease by one. This fact tells us
that We can traverse the suffix tree from the top to the root by using the suffix
links. And the order of such traversing is exactly what we want. Finally, there
is a special suffix trie for empty string Suf f ixT rie(N IL), We define the top
equals to the root in this case.
function Insert(top, c)
if top = NIL then
. The trie is empty
top Create-Empty-Node
T top
T 0 Create-Empty-Node
. dummy init value
while T 6= NIL Children(T )[c] = NIL do
Children(T )[c] Create-Empty-Node
Suffix-Link(T 0 ) Children(T )[c]
T 0 Children(T )[c]
T Suffix-Link(T )
if T 6= NIL then
Suffix-Link(T 0 ) Children(T )[c]
return Children(top)[c]
. returns the new top
Function Insert, updates Suf f ixT rie(Si ) to Suf f ixT rie(Si+1 ). It takes
two arguments, one is the top of Suf f ixT rie(Si ), the other is si+1 character.
If the top is NIL, it means the tree is empty, so there is no root. The algorithm
creates a root node in this case. A sentinel empty node T 0 is created. It keeps
tracking the previous created new node. In the main loop, the algorithm checks
every node one by one along the suffix link. If the node hasnt the si+1 -child, it
then creates a new node, and binds the edge to character si+1 . The algorithm
repeatedly goes up along the suffix link until either arrives at the root, or find a
node which has the si+1 -child already. After the loop, if the node isnt empty,
145
it means we stop at a node which has the si+1 -child. The last suffix link then
points to that child. Finally, the new top position is returned, so that we can
further insert other characters to the suffix trie.
For a given string S, the suffix trie can be built by repeatedly calling Insert
function.
1: function Suffix-Trie(S)
2:
t NIL
3:
for i 1 to |S| do
4:
t Insert(t, si )
5:
return t
This algorithm returns the top of the suffix trie, but not the root. In order
to access the root, we can traverse along the suffix link.
1: function Root(T )
2:
while Suffix-Link(T ) 6= NIL do
3:
T Suffix-Link(T )
4:
return T
Figure 6.6 shows the steps when construct suffix trie for cacao. Only the
last layer of suffix links are shown.
For Insert algorithm, the computation time is proportion to the size of suffix
trie. In the worse case, the suffix trie is built in O(n2 ) time, where n = |S|. One
example is S = an bn , that there are n characters of a and n characters of b.
The following example Python program implements the suffix trie construction algorithm.
def suffix_trie(str):
t = None
for c in str:
t = insert(t, c)
return root(t)
def insert(top, c):
if top is None:
top=STrie()
node = top
new_node = STrie() #dummy init value
while (node is not None) and (c not in [Link]):
new_node.suffix = [Link][c] = STrie(node)
new_node = [Link][c]
node = [Link]
if node is not None:
new_node.suffix = [Link][c]
return [Link][c] #update top
def root(node):
while [Link] is not None:
node = [Link]
return node
146
root
root
root
(a) Empty
(b) c
(c) ca
root
a
root
X
c
root
Y
a c
Z
c
a
A
(d) cac
(e) caca
(f) cacao
Figure 6.6: Construct suffix trie for cacao. There are 6 steps. Only the last
layer of suffix links are shown in dotted arrow.
6.3
147
Suffix Tree
Suffix trie isnt space efficient, and the construction time is quadratic. If dont
care about the speed, we can compress the suffix trie to suffix tree[6]. Ukkonen
found a linear time on-line suffix tree construction algorithm in 1995.
6.3.1
On-line construction
a bananas na
X
na s
nas s
Y
nas s
148
represents suffix ana. In other words, we can represent Y with a pair of node
and sub-string, like (X, w), where w = na. Ukkonen defines such kind of
pair as reference pair. Not only the explicit node, but also the implicit position
in suffix tree can be represented with reference pair. For example, (X, n )
represents to a position which is not an explicit node. By using reference pair,
we can represent every position in a suffix tree.
In order to save spaces, for string S, all sub-strings can be represented as
a pair of index (l, r), where l is the left index and r is the right index of the
character for the sub-string. For instance, if S = bananas, and the index
starts from 1, sub-string na can be represented with pair (3, 4). As the result,
there is only one copy of the complete string, and all positions in the suffix tree
is defined as (node, (l, r)). This is the final form of reference pair.
With reference pair, node transfer for suffix tree can be defined as the following.
Children(X)[sl ] ((l, r), Y ) Y (X, (l, r))
If character sl = c, we say that node X has a c-child, This child is Y . Y can
be transferred from X with sub string (l, r) Each node can have at most one
c-child.
canonical reference pair
Its obvious that the one position in a suffix tree may have multiple reference
pairs. For example, node Y in Figure 6.7 can be either denoted as (X, (3, 4)) or
(root, (2, 4)). If we define empty string = (i, i 1), Y can also be represented
as (Y, ).
The canonical reference pair is the one which has the closest node to the position. Specially, in case the position is an explicit node, the canonical reference
pair is (node, ), so (Y, ) is the canonical reference pair of node Y .
Below algorithm converts a reference pair (node, (l, r)) to the canonical reference pair (node0 , (l0 , r)). Note that since r doesnt change, the algorithm can
only return (node0 , l0 ) as the result.
Algorithm 3 Convert reference pair to canonical reference pair
1: function Canonize(node, (l, r))
2:
if node = NIL then
3:
if (l, r) = then
4:
return ( NIL, l)
5:
else
6:
return Canonize(root, (l + 1, r))
7:
while l r do
. (l, r)isn0 tempty
8:
((l0 , r0 ), node0 ) Children(node)[sl ]
9:
if r l r0 l0 then
10:
l l + r 0 l0 + 1
. Remove |(l0 , r0 )| chars from (l, r)
11:
node node0
12:
else
13:
break
14:
return (node, l)
If the passed in node parameter is NIL, it means a very special case. The
149
150
Ukkonen finds a very important fact that if (node, (l, i)) is the end point of
Suf f ixT ree(Si ), then (node, (l, i + 1)) is the active point of Suf f ixT ree(Si+1 ).
This is because if (node, (l, i)) is the end point of Suf f ixT ree(Si ), it must
have a si+1 -child (either explicitly or implicitly). If this end point represents
suffix sk sk+1 ...si , it is the longest suffix in Suf f ixT ree(Si ) which satisfies
sk sk+1 ...si si+1 is a sub-string of Si . Consider Si+1 , sk sk+1 ...si si+1 must occur at least twice in Si+1 , so position (node, (l, i + 1)) is the active point of
Suf f ixT ree(Si+1 ). Figure 6.9 shows about this truth.
Figure 6.9:
End
Suf f ixT ree(Si+1 ).
point
in
and
active
point
in
151
Because there is only one copy of the complete string, all sub-strings are
represent in (lef t, right) pairs, and the leaf are open pairs as (lef t, ). The
suffix tree is defined like below.
152
ca
root
(a) Empty
(b) c
(c) ca
ca a
a c cac
(d) cac
aca caca
(e) caca
cao o
cao o
(f) cacao
Figure 6.10: Construct suffix tree for cacao. There are 6 steps. Only the last
layer of suffix links are shown in dotted arrow.
class STree:
def __init__(self, s):
[Link] = s
[Link] = len(s)+1000
[Link] = Node()
The infinity is defined as the length of the string plus a big number. Some
auxiliary functions are defined.
def substr(str, str_ref):
(l, r)=str_ref
return str[l:r+1]
def length(str_ref):
(l, r)=str_ref
return r-l+1
153
154
The edge function extracts a common prefix from a list of strings. The prefix
returned by edge function may not be the longest one, empty string is also
allowed. The exact behavior can be customized with different edge functions.
build(edge, X)
This defines a generic radix tree building function. It takes an edge function,
and a set of strings. X can be all suffixes of a string, so that we get suffix trie
or suffix tree. Well also explain later that X can be all prefixes, which lead to
normal prefix trie or Patricia.
Suppose all the strings are built from a character set . When build the
tree, if the string is empty, X only contains one empty string as well. The result
tree is an empty leaf; Otherwise, we examine every character in , group the
strings in X with their initial characters, the apply the edge function to these
groups.
branch({({c} p,
build(edge, X) ==
leaf
build(edge, X 0 ))|
c ,
G {group(X, c)},
(p, X 0 ) {edge(G)}})
: X = {}
: otherwise
(6.3)
The algorithm categorizes all suffixes by the first letter in several groups.
It removes the first letter for each element in every group. For example, the
suffixes {acac, cac, ac, c} are categorized to groups {(a, [cac, c]),
(c, [ac, ])}.
group(X, c) = {C 0 |{c1 } C 0 X, c1 = c}
(6.4)
Function group enumerates all suffixes in X, for each one, denote the first
character as c1 , the rest characters as C 0 . If c1 is same as the given character c,
then C 0 is collected.
Below example Haskell program implements the generic radix tree building
algorithm.
155
alpha = [a..z]++[A..Z]
lazyTree::EdgeFunc [String] Tr
lazyTree edge = build where
build [[]] = Lf
build ss = Br [(a:prefix, build ss) |
aalpha,
xs@(x:_) [[cs | c:csss, c==a]],
(prefix, ss)[edge xs]]
Different edge functions produce different radix trees. Since edge function
extracts common prefix from a set of strings. The simplest one constantly uses
the empty string as the common prefix. This edge function builds a trie.
edgeT rie(X) = (, X)
(6.5)
We can also realize an edge function, that extracts the longest common
prefix. Such edge function builds a Patricia. Denote the strings as X =
{x1 , x2 , ..., xn }, for the each string xi , let the initial character be ci , and the
rest characters in xi as Wi . If there is only one string in X, the longest common prefix is definitely this string; If there are two strings start with different
initial characters, the longest common prefix is empty; Otherwise,it means all
the strings share the same initial character. This character definitely belongs to
the longest common prefix. We can remove it from all strings, and recursively
call the edge function.
edgeT ree(X) =
(6.6)
For any given string, we can build suffix trie and suffix tree by feeding suffixes
to these two edge functions.
suf f ixT rie(S) = build(edgeT rie, suf f ixes(S))
(6.7)
(6.8)
(6.9)
156
6.4
(6.10)
Suffix tree can help to solve many string and DNA pattern manipulation problems particularly fast.
6.4.1
String/Pattern searching
There a plenty of string searching algorithms, such as the famous KMP(KnuthMorris-Pratt algorithm is introduced in the chapter of search) algorithm. Suffix
tree can perform at the same level as KMP[11]. the string searching in bound
to O(m) time, where m is the length of the sub-string to be search. However,
O(n) time is required to build the suffix tree in advance, where n is the length
of the text[12].
Not only sub-string searching, but also pattern matching, including regular
expression matching can be solved with suffix tree. Ukkonen summarizes this
kind of problems as sub-string motifs: For a string S, Suf f ixT ree(S) gives
complete occurrence counts of all sub-string motifs of S in O(n) time, although
S may have O(n2 ) sub-strings.
Find the number of sub-string occurrence
Every internal node in Suf f ixT ree(S) is corresponding to a sub-string occurs
multiple times in S. If this sub-string occurs k times in S, then there are total
k sub-trees under this node[13].
1: function Lookup-Pattern(T, s)
2:
loop
3:
match FALSE
4:
for (si , Ti ) Values(Children(T )) do
5:
if s @ si then
6:
return Max(|Children(Ti )|, 1)
7:
else if si @ s then
8:
match TRUE
9:
T Ti
10:
s s si
11:
break
12:
if match then
13:
return 0
When look up a sub-string pattern s in text w, we build the suffix tree T
from the text. Start from the root, we iterate all children. For every string
reference pair si and sub-tree Ti , we check if the s is prefix of si . If yes, the
number of sub-trees in Ti is returned as the result. There is a special case that
Ti is a leaf without any children. We need return 1 but not zero. This is why
we use the maximum function. Otherwise, if si is prefix of s, then we remove
si part from s, and recursively look up in Ti .
The following Python program implements this algorithm.
157
This algorithm can also be realized in recursive way. For the non-leaf suffix
tree T , denote the children as C = {(s1 , T1 ), (s2 , T2 ), ...}. We search the sub
string among the children.
lookuppattern (T, s) = f ind(C, s)
(6.11)
If children C is empty, it means the sub string doesnt occurs at all. Otherwise, we examine the first pair (s1 , T1 ), if s is prefix of s1 , then the number
of sub-trees in T1 is the result. If s1 is prefix of s, we remove s1 from s, and
recursively look up it in T1 ; otherwise, we go on to examine the rest children
denoted as C 0 .
0
max(1, |C1 |)
f ind(C, s) =
lookuppattern (T1 , s s1 )
f ind(C 0 , s)
:
:
:
:
C=
s @ s1
s1 @ s
otherwise
(6.12)
We always append special terminator to the string (the $ in above program), so that there wont be any suffix becomes the prefix of the other[3].
Suffix tree also supports searching pattern like a**n, we skip it here. Readers can refer to [13] and [14] for details.
158
6.4.2
mississippi$ p
$ ppi$ ssi
A
ppi$ ssippi$
i$ pi$
si
B
ppi$ ssippi$
C
ppi$ ssippi$
159
function Update(L, s)
if L = NIL |l1 | < |s| then
return l {s}
4:
if |l1 | = |s| then
5:
return Append(L, s)
6:
return L
The above algorithm can be implemented in Python as the following example
program.
1:
2:
3:
def lrs(t):
queue = [("", [Link])]
res = []
while len(queue)>0:
(s, node) = [Link](0)
for _, (str_ref, tr) in [Link]():
if len([Link])>0:
s1 = s+[Link](str_ref)
[Link]((s1, tr))
res = update_max(res, s1)
return res
def update_max(lst, x):
if lst ==[] or len(lst[0]) < len(x):
return [x]
if len(lst[0]) == len(x):
return lst + [x]
return lst
Searching the deepest branch can also be realized recursively. If the tree is
just a leaf node, empty string is returned, else the algorithm tries to find the
longest repeated sub-string from the children.
: leaf (T )
longest({si LRS(Ti )|(si , Ti ) C, leaf (Ti )}) : otherwise
(6.13)
The following Haskell example program implements the longest repeated
sub-string algorithm.
LRS(T ) =
isLeaf Lf = True
isLeaf _ = False
lrs::TrString
lrs Lf = ""
lrs (Br lst) = find $ filter (not isLeaf snd) lst where
find [] = ""
find ((s, t):xs) = maximumBy (compare on length) [s++(lrs t), find xs]
6.4.3
The longest common sub-string, can also be quickly found with suffix tree. The
solution is to build a generalized suffix tree. If the two strings are denoted as
txt1 and txt2 , a generalized suffix tree is Suf f ixT ree(txt1 $1 txt2 $2 ). Where
160
161
return res
def is_leaf(node):
return [Link]=={}
def match_fork(t, node):
if len([Link])==2:
[(_, (str_ref1, tr1)), (_, (str_ref2, tr2))]=[Link]()
return is_leaf(tr1) and is_leaf(tr2) and
([Link](str_ref1).find(#)!=-1) !=
([Link](str_ref2).find(#)!=-1)
return False
The longest common sub-string finding algorithm can also be realized recursively. If the suffix tree T is a leaf, the result is empty; Otherwise, we examine
all children in T . For those satisfy the matching criteria, the sub-string are
collected as candidates; for those dont matching, we recursively search the
common sub-string among the children. The longest candidate is selected as
the final result.
: leaf (T )
{si |(si , Ti ) C, match(Ti )}
: otherwise
longest(
lcs Lf = []
lcs (Br lst) = find $ filter (not isLeaf snd) lst where
find [] = []
find ((s, t):xs) = maxBy (compare on length)
(if match t
then s:(find xs)
else (map (s++) (lcs t)) ++ (find xs))
match (Br [(s1, Lf), (s2, Lf)]) = ("#" isInfixOf s1) /= ("#" isInfixOf s2)
match _ = False
6.4.4
(6.15)
162
6.4.5
Others
Suffix tree can also be used for data compression, such as Burrows-Wheeler
transform, LZW compression (LZSS) etc. [3]
6.5
Suffix Tree was first introduced by Weiner in 1973 [?]. In 1976, McCreight
greatly simplified the construction algorithm. McCreight constructs the suffix
tree from right to left. In 1995, Ukkonen gave the first on-line construction
algorithms from left to right. All the three algorithms are linear time (O(n)).
And some research shows the relationship among these 3 algorithms. [7]
Bibliography
[1] Esko Ukkonen. On-line construction of suffix trees. Algorithmica
14
(3):
249260.
doi:10.1007/BF01206331.
[Link]
[2] Weiner, P. Linear pattern matching algorithms, 14th Annual
IEEE Symposium on Switching and Automata Theory, pp. 1C11,
doi:10.1109/SWAT.1973.13
[3] Suffix Tree, Wikipedia. [Link] tree
[4] Esko Ukkonen. Suffix tree and suffix array techniques for pattern analysis
in strings. [Link]
[5] Trie, Wikipedia. [Link]
[6] Suffix Tree (Java). [Link] tree (Java)
[7] Robert Giegerich and Stefan Kurtz. From Ukkonen to McCreight
and Weiner: A Unifying View of Linear-Time Suffix Tree Construction. Science of Computer Programming 25(2-3):187-218, 1995.
[Link]
[8] Robert Giegerich and Stefan Kurtz. A Comparison of Imperative and Purely Functional Suffix Tree Constructions. Algorithmica 19 (3):
331353. doi:10.1007/PL00009177. [Link]/pubs/pdf/[Link]
[9] Bryan OSullivan. suffixtree: Efficient, lazy suffix tree implementation.
[Link]
[10] Danny. [Link] dyoo/plt/suffixtree/
[11] Zhang
Shaojie.
Lecture
of
Suffix
[Link] shzhang/Combio09/[Link]
Trees.
163
164
B-Trees
Chapter 7
B-Trees
7.1
Introduction
166
CHAPTER 7. B-TREES
For non-empty binary tree (L, k, R), where L, R and k are the left, right children, and the key. Function Key(T ) accesses the key of tree T . The constraint
can be represented as the following.
x L, y R Key(x) k Key(y)
(7.1)
If we extend this definition to allow multiple keys and children, we get the
B-tree definition.
A B-tree
is either empty;
or contains n keys, and n + 1 children, each child is also a B-Tree, we
denote these keys and children as k1 , k2 , ..., kn and c1 , c2 , ..., cn , cn+1 .
Figure 7.2 illustrates a B-Tree node.
C[1]
K[1]
C[2]
K[2]
...
C[n]
K[n]
C[n+1]
(7.2)
Finally, after adding some constraints to make the tree balanced, we get the
complete B-tree definition.
All leaves have the same depth;
We define integral number, t, as the minimum degree of B-tree;
each node can have at most 2t 1 keys;
each node can have at least t 1 keys, except the root;
Consider a B-tree holds n keys. The minimum degree t 2. The height is
h. All the nodes have at least t 1 keys except the root. The root contains at
least 1 key. There are at least 2 nodes at depth 1, at least 2t nodes at depth 2,
at least 2t2 nodes at depth 3, ..., finally, there are at least 2th1 nodes at depth
h. Times all nodes with t 1 except for root, the total number of keys satisfies
the following inequality.
n 1 + (t 1)(2 + 2t + 2t2 + ... + 2th1 )
h1
X
tk
= 1 + 2(t 1)
k=0
th 1
= 1 + 2(t 1)
t1
= 2th 1
(7.3)
7.2. INSERTION
167
Thus we have the inequality between the height and the number of keys.
n+1
(7.4)
2
This is the reason why B-tree is balanced. The simplest B-tree is so called
2-3-4 tree, where t = 2, that every node except root contains 2 or 3 or 4 keys.
red-black tree can be mapped to 2-3-4 tree essentially.
The following Python code shows example B-tree definition. It explicitly
pass t when create a node.
h logt
class BTree:
def __init__(self, t):
self.t = t
[Link] = []
[Link] = []
B-tree nodes commonly have satellite data as well. We ignore satellite data
for illustration purpose.
In this chapter, we will firstly introduce how to generate B-tree by insertion.
Two different methods will be explained. One is the classic method as in [2],
that we split the node before insertion if its full; the other is the modify-fix
approach which is quite similar to the red-black tree solution [3] [2]. We will
next explain how to delete key from B-tree and how to look up a key.
7.2
Insertion
B-tree can be created by inserting keys repeatedly. The basic idea is similar to
the binary search tree. When insert key x, from the tree root, we examine all
the keys in the node to find a position where all the keys on the left are less
than x, while all the keys on the right are greater than x. If the current node
is a leaf node, and it is not full (there are less then 2t 1 keys in this node), x
will be insert at this position. Otherwise, the position points to a child node.
We need recursively insert x to it.
Figure 7.3 shows one example. The B-tree illustrated is 2-3-4 tree. When
insert key x = 22, because its greater than the root, the right child contains
key 26, 38, 45 is examined next; Since 22 < 26, the first child contains key 21
and 25 are examined. This is a leaf node, and it is not full, key 22 is inserted
to this node.
However, if there are 2t 1 keys in the leaf, the new key x cant be inserted,
because this node is full. When try to insert key 18 to the above example
B-tree will meet this problem. There are 2 methods to solve it.
7.2.1
Splitting
168
CHAPTER 7. B-TREES
20
11
12
26
15
16
17
21
25
30
38
31
45
37
40
42
46
47
50
(a) Insert key 22 to the 2-3-4 tree. 22 > 20, go to the right child; 22 < 26 go to the first child.
20
11
12
26
15
16
17
21
22
25
30
38
31
37
45
40
42
46
valid B-tree nodes. the middle part is the t-th key. We can push it up to the
parent node (if the current node is root, then the this key, with the two children
will be the new root).
For node x, denote K(x) as keys, C(x) as children. The i-th key as ki (x),
the j-th child as cj (x). Below algorithm describes how to split the i-th child for
a given node.
1: procedure Split-Child(node, i)
2:
x ci (node)
3:
y CREATE-NODE
4:
Insert(K(node), i, kt (x))
5:
Insert(C(node), i + 1, y)
6:
K(y) {kt+1 (x), kt+2 (x), ..., k2t1 (x)}
7:
K(x) {k1 (x), k2 (x), ..., kt1 (x)}
8:
if y is not leaf then
9:
C(y) {ct+1 (x), ct+2 (x), ..., c2t (x)}
10:
C(x) {c1 (x), c2 (x), ..., ct (x)}
The following example Python program implements this child splitting algorithm.
def split_child(node, i):
t = node.t
x = [Link][i]
y = BTree(t)
[Link](i, [Link][t-1])
[Link](i+1, y)
[Link] = [Link][t:]
[Link] = [Link][:t-1]
47
50
7.2. INSERTION
169
K[1]
C[1]
C[2]
K[2]
...
...
K[t]
C[t]
...
K[2t-1]
C[t+1]
...
C[2t-1]
C[2t]
...
K[1]
C[1]
K[2]
C[2]
...
...
K[t]
...
K[t-1]
K[t+1]
C[t]
C[t+1]
...
...
K[2t-1]
C[2t-1]
if not is_leaf(x):
[Link] = [Link][t:]
[Link] = [Link][:t]
170
CHAPTER 7. B-TREES
If it is a leaf node, the new key is inserted to the proper position based on the
order; Otherwise, the algorithm finds a proper child node to which the new key
will be inserted. If this child is full, splitting will be performed.
1: function Insert-Nonfull(T, k)
2:
if T is leaf then
3:
i1
4:
while i |K(T )| k > ki (T ) do
5:
ii+1
6:
Insert(K(T ), i, k)
7:
else
8:
i |K(T )|
9:
while i > 1 k < ki (T ) do
10:
ii1
11:
if ci (T ) is full then
12:
Split-Child(T, i)
13:
if k > ki (T ) then
14:
ii+1
15:
Insert-Nonfull(ci (T ), k)
16:
return T
This algorithm is recursive. In B-tree, the minimum degree t is typically
relative to magnetic disk structure. Even small depth can support huge amount
of data (with t = 10, maximum to 10 billion data can be stored in a B-tree with
height of 10). The recursion can also be eliminated. This is left as exercise to
the reader.
Figure 7.5 shows the result of continuously inserting keys G, M, P, X, A, C,
D, E, J, K, N, O, R, S, T, U, V, Y, Z to the empty tree. The first result is the
2-3-4 tree (t = 2). The second result shows how it varies when t = 3.
E
(b) t = 3
7.2. INSERTION
171
root = tr
if is_full(root):
s = BTree(root.t)
[Link](0, root)
split_child(s, 0)
root = s
return insert_nonfull(root, key)
For the array based collection, append on the tail is much more effective
than insert in other position, because the later takes O(n) time, if the length
of the collection is n. The ordered insert program firstly appends the new
element at the end of the existing collection, then iterates from the last element
to the first one, and checks if the current two elements next to each other are
ordered. If not, these two elements will be swapped.
Insert then fixing
In functional settings, B-tree insertion can be realized in a way similar to redblack tree. When insert a key to red-black tree, it is firstly inserted as in the
normal binary search tree, then recursive fixing is performed to resume the
balance of the tree. B-tree can be viewed as extension to the binary search tree,
that each node contains multiple keys and children. We can firstly insert the
key without considering if the node is full. Then perform fixing to satisfy the
minimum degree constraint.
insert(T, k) = f ix(ins(T, k))
(7.5)
172
CHAPTER 7. B-TREES
Function ins(T, k) traverse the B-tree T from root to find a proper position
where key k can be inserted. After that, function f ix is applied to resume the
B-tree properties. Denote B-tree in a form of T = (K, C, t), where K represents
keys, C represents children, and t is the minimum degree.
Below is the Haskell definition of B-tree.
data BTree a = Node{ keys :: [a]
, children :: [BTree a]
, degree :: Int} deriving (Eq)
There are two cases when realize ins(T, k) function. If the tree T is leaf, k
is inserted to the keys; Otherwise if T is the branch node, we need recursively
insert k to the proper child.
Figure 7.6 shows the branch case. The algorithm first locates the position.
for certain key ki , if the new key k to be inserted satisfy ki1 < k < ki , Then
we need recursively insert k to child ci .
This position divides the node into 3 parts, the left part, the child ci and
the right part.
k, K[i-1]<