AlgoXY Elementary Algorithms
AlgoXY Elementary Algorithms
. If the length of A
. Otherwise, it
means the minimum free number is located in A
, m+ 1, u) : |A
| = ml + 1
search(A
, l, m) : otherwise
where
m =
l+u
2
A
= {x A x m}
A
= {x A x > m}
It is obvious that this algorithm doesnt need any extra spaces
2
. In each call
it performs O(|A|) comparison to build A
and A
and A
= {x
2
, x
3
, ...} and Y
= {y
2
, y
3
, ...}. We have
X Y =
_
_
X : Y =
Y : X =
{x
1
, X
Y } : x
1
< y
1
{x
1
, X
} : x
1
= y
1
{y
1
, X Y
} : x
1
> y
1
In a functional programming language such as Haskell, which supports lazy
evaluation, The above innity series functions can be translate into the following
program.
18 Preface
ns = 1:merge (map (2) ns) (merge (map (3) ns) (map (5) ns))
merge [] l = l
merge l [] = l
merge (x:xs) (y:ys) | x <y = x : merge xs (y:ys)
| x ==y = x : merge xs ys
| otherwise = y : merge (x:xs) ys
By evaluate ns !! (n-1), we can get the 1500th number as below.
>ns !! (1500-1)
859963392
0.3.3 Improvement 2
Considering the above solution, although it is much faster than the brute-force
one, It still has some drawbacks. It produces many duplicated numbers and
they are nally dropped when examine the queue. Secondly, it does linear scan
and insertion to keep the order of all elements in the queue, which degrade the
ENQUEUE operation from O(1) to O(|Q|).
If we use three queues instead of using only one, we can improve the solution
one step ahead. Denote these queues as Q
2
, Q
3
, and Q
5
, and we initialize them
as Q
2
= {2}, Q
3
= {3} and Q
5
= {5}. Each time we DEQUEUEed the smallest
one from Q
2
, Q
3
, and Q
5
as x. And do the following test:
If x comes from Q
2
, we ENQUEUE 2x, 3x, and 5x back to Q
2
, Q
3
, and
Q
5
respectively;
If x comes from Q
3
, we only need ENQUEUE 3x to Q
3
, and 5x to Q
5
;
We neednt ENQUEUE 2x to Q
2
, because 2x have already existed in Q
3
;
If x comes from Q
5
, we only need ENQUEUE 5x to Q
5
; there is no need
to ENQUEUE 3x, 5x to Q
3
, Q
5
because they have already been in the
queues;
We repeatedly ENQUEUE the smallest one until we nd the n-th element.
The algorithm based on this idea is implemented as below.
1: function Get-Number(n)
2: if n = 1 then
3: return 1
4: else
5: Q
2
{2}
6: Q
3
{3}
7: Q
5
{5}
8: while n > 1 do
9: x min(Head(Q
2
), Head(Q
3
), Head(Q
5
))
10: if x = Head(Q
2
) then
11: Dequeue(Q
2
)
12: Enqueue(Q
2
, 2x)
13: Enqueue(Q
3
, 3x)
14: Enqueue(Q
5
, 5x)
15: else if x = Head(Q
3
) then
0.3. THE NUMBER PUZZLE, POWER OF DATA STRUCTURE 19
2
min=2
3 5
2*min=4 3*min=6 5*min=10
4
min=3
3 6 5 10
3*min=9 5*min=15
4
min=4
6 9 5 10 15
2*min=8 3*min=12 5*min=20
8
min=5
6 9 12 5 10 15 20
5*min=25
Figure 4: First 4 steps of constructing numbers with Q
2
, Q
3
, and Q
5
.
1. Queues are initialized with 2, 3, 5 as the only element;
2. New elements 4, 6, and 10 are pushed back;
3. New elements 9, and 15, are pushed back;
4. New elements 8, 12, and 20 are pushed back;
5. New element 25 is pushed back.
16: Dequeue(Q
3
)
17: Enqueue(Q
3
, 3x)
18: Enqueue(Q
5
, 5x)
19: else
20: Dequeue(Q
5
)
21: Enqueue(Q
5
, 5x)
22: n n 1
23: return x
This algorithm loops n times, and within each loop, it extract one head
element from the three queues, which takes constant time. Then it appends
one to three new elements at the end of queues which bounds to constant time
too. So the total time of the algorithm bounds to O(n). The C++ program
translated from this algorithm shown below takes less than 1 s to produce the
1500th number, 859963392.
typedef unsigned long Integer;
Integer get_number(int n){
if(n==1)
return 1;
queue<Integer> Q2, Q3, Q5;
Q2.push(2);
Q3.push(3);
Q5.push(5);
Integer x;
20 Preface
while(n-- > 1){
x = min(min(Q2.front(), Q3.front()), Q5.front());
if(x==Q2.front()){
Q2.pop();
Q2.push(x2);
Q3.push(x3);
Q5.push(x5);
}
else if(x==Q3.front()){
Q3.pop();
Q3.push(x3);
Q5.push(x5);
}
else{
Q5.pop();
Q5.push(x5);
}
}
return x;
}
This solution can be also implemented in Functional way. We dene a func-
tion take(n), which will return the rst n numbers contains only factor 2, 3, or
5.
take(n) = f(n, {1}, {2}, {3}, {5})
Where
f(n, X, Q
2
, Q
3
, Q
5
) =
_
X : n = 1
f(n 1, X {x}, Q
2
, Q
3
, Q
5
) : otherwise
x = min(Q
21
, Q
31
, Q
51
)
Q
2
, Q
3
, Q
5
=
_
_
_
{Q
22
, Q
23
, ...} {2x}, Q
3
{3x}, Q
5
{5x} : x = Q
21
Q
2
, {Q
32
, Q
33
, ...} {3x}, Q5 {5x} : x = Q
31
Q
2
, Q
3
, {Q
52
, Q
53
, ...} {5x} : x = Q
51
And these functional denition can be realized in Haskell as the following.
ks 1 xs _ = xs
ks n xs (q2, q3, q5) = ks (n-1) (xs++[x]) update
where
x = minimum $ map head [q2, q3, q5]
update | x == head q2 = ((tail q2)++[x2], q3++[x3], q5++[x5])
| x == head q3 = (q2, (tail q3)++[x3], q5++[x5])
| otherwise = (q2, q3, (tail q5)++[x5])
takeN n = ks n [1] ([2], [3], [5])
Invoke last takeN 1500 will generate the correct answer 859963392.
0.4. NOTES AND SHORT SUMMARY 21
0.4 Notes and short summary
If review the 2 puzzles, we found in both cases, the brute-force solutions are so
weak. In the rst problem, its quite poor in dealing with long ID list, while in
the second problem, it doesnt work at all.
The rst problem shows the power of algorithms, while the second problem
tells why data structure is important. There are plenty of interesting problems,
which are hard to solve before computer was invented. With the aid of com-
puter and programming, we are able to nd the answer in a quite dierent way.
Compare to what we learned in mathematics course in school, we havent been
taught the method like this.
While there have been already a lot of wonderful books about algorithms,
data structures and math, however, few of them provide the comparison between
the procedural solution and the functional solution. From the above discussion,
it can be found that functional solution sometimes is very expressive and they
are close to what we are familiar in mathematics.
This series of post focus on providing both imperative and functional algo-
rithms and data structures. Many functional data structures can be referenced
from Okasakis book[6]. While the imperative ones can be founded in classic
text books [2] or even in WIKIpedia. Multiple programming languages, includ-
ing, C, C++, Python, Haskell, and Scheme/Lisp will be used. In order to make
it easy to read by programmers with dierent background, pseudo code and
mathematical function are the regular descriptions of each post.
The author is NOT a native English speaker, the reason why this book is
only available in English for the time being is because the contents are still
changing frequently. Any feedback, comments, or criticizes are welcome.
0.5 Structure of the contents
In the following series of post, Ill rst introduce about elementary data struc-
tures before algorithms, because many algorithms need knowledge of data struc-
tures as prerequisite.
The hello world data structure, binary search tree is the rst topic; Then
we introduce how to solve the balance problem of binary search tree. After
that, Ill show other interesting trees. Trie, Patricia, sux trees are useful in
text manipulation. While B-trees are commonly used in le system and data
base implementation.
The second part of data structures is about heaps. Well provide a gen-
eral Heap denition and introduce about binary heaps by array and by explicit
binary trees. Then well extend to K-ary heaps including Binomial heaps, Fi-
bonacci heaps, and pairing heaps.
Array and queues are considered among the easiest data structures typically,
However, well show how dicult to implement them in the third part.
As the elementary sort algorithms, well introduce insertion sort, quick sort,
merge sort etc in both imperative way and functional way.
The nal part is about searching, besides the element searching, well also
show string matching algorithms such as KMP.
All the posts are provided under GNU FDL (Free document license), and
programs are under GNU GPL.
22 Preface
0.6 Appendix
All programs provided along with this article are free for downloading. download
position: http://sites.google.com/site/algoxy/introduction
Bibliography
[1] Richard Bird. Pearls of functional algorithm design. Cambridge Univer-
sity Press; 1 edition (November 1, 2010). ISBN-10: 0521513383
[2] Jon Bentley. Programming Pearls(2nd Edition). Addison-Wesley Profes-
sional; 2 edition (October 7, 1999). ISBN-13: 978-0201657883
[3] Chris Okasaki. Purely Functional Data Structures. Cambridge university
press, (July 1, 1999), ISBN-13: 978-0521663502
[4] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Cliord
Stein. Introduction to Algorithms, Second Edition. The MIT Press, 2001.
ISBN: 0262032937.
23
24 BIBLIOGRAPHY
Part I
Trees
25
AlgoXY 27
Binary search tree, the hello world data structure
Larry LIU Xinyu Larry LIU Xinyu
Email: [email protected]
28 Binary search tree
Chapter 1
Binary search tree, the
hello world data structure
1.1 Introduction
Its typically considered that Arrays or Lists are the hello world data struc-
ture. However, well see they are not so easy to implement actually. In some
procedural settings, Arrays are the elementary representation, and it is possible
to realize linked list by array (section 10.3 in [2]); While in some functional
settings, Linked list are the elementary bricks to build arrays and other data
structures.
Considering these factors, we start with Binary Search Tree (or BST) as the
hello world data structure. Jon Bentley mentioned an interesting problem in
programming pearls [2]. The problem is about to count the number of times
each word occurs in a big text. And the solution is something like the below
C++ code.
int main(int, char ){
map<string, int> dict;
string s;
while(cin>>s)
++dict[s];
map<string, int>::iterator it=dict.begin();
for(; it!=dict.end(); ++it)
cout<<itfirst<<":"<<itsecond<<"n";
}
And we can run it to produce the word counting result as the following
1
.
$ g++ wordcount.cpp -o wordcount
$ cat bbe.txt | ./wordcount > wc.txt
The map provided in standard template library is a kind of balanced binary
search tree with augmented data. Here we use the word in the text as the key
and the number of occurrence as the augmented data. This program is fast, and
1
This is not UNIX unique command, in Windows OS, it can be achieved by:
type bbe.txt|wordcount.exe > wc.txt
29
30CHAPTER 1. BINARY SEARCH TREE, THE HELLO WORLD DATA STRUCTURE
it reects the power of binary search tree. Well introduce how to implement
BST in this post and show how to solve the balancing problem in later post.
Before we dive into binary search tree. Lets rst introduce about the more
general binary tree.
The concept of Binary tree is a recursive denition. Binary search tree is
just a special type of binary tree. The Binary tree is typically dened as the
following.
A binary tree is
either an empty node;
or a node contains 3 parts, a value, a left child which is a binary tree and
a right child which is also a binary tree.
Figure 1.1 shows this concept and an example binary tree.
k
L R
(a) Concept of binary tree
16
4 10
14 7
2 8 1
9 3
(b) An example binary tree
Figure 1.1: Binary tree concept and an example.
A binary search tree is a binary tree which satises the below criteria. for
each node in binary search tree,
all the values in left child tree are less than the value of of this node;
the value of this node is less than any values in its right child tree.
Figure 1.2 shows an example of binary search tree. Compare with Figure
1.1 we can see the dierence about the key ordering between them.
1.2. DATA LAYOUT 31
4
3 8
1
2
7 16
10
9 14
Figure 1.2: A Binary search tree example.
1.2 Data Layout
Based on the recursive denition of binary search tree, we can draw the data
layout in procedural setting with pointer supported as in gure 1.3.
The node contains a eld of key, which can be augmented with satellite data;
a eld contains a pointer to the left child and a eld point to the right child. In
order to back-track an ancestor easily, a parent eld can be provided as well.
In this post, well ignore the satellite data for simple illustration purpose.
Based on this layout, the node of binary search tree can be dened in a proce-
dural language, such as C++ as the following.
template<class T>
struct node{
node(T x):key(x), left(0), right(0), parent(0){}
~node(){
delete left;
delete right;
}
node left;
node right;
node parent; //parent is optional, itshelpfulforsucc/pred
Tkey;
};
There is another setting, for instance in Scheme/Lisp languages, the ele-
mentary data structure is linked-list. Figure 1.4 shows how a binary search tree
node can be built on top of linked-list.
Because in pure functional setting, Its hard to use pointer for back tracking
the ancestors, (and typically, there is no need to do back tracking, since we can
32CHAPTER 1. BINARY SEARCH TREE, THE HELLO WORLD DATA STRUCTURE
key + sitellite data
left
right
parent
key + sitellite data
left
right
parent
key + sitellite data
left
right
parent
... ... ... ...
Figure 1.3: Layout of nodes with parent eld.
key next
left ... next
right ... NIL
Figure 1.4: Binary search tree node layout on top of linked list. Where left...
and right ... are either empty or binary search tree node composed in the same
way.
1.3. INSERTION 33
provide top-down solution in recursive way) there is not parent eld in such
layout.
For simplied reason, well skip the detailed layout in the future, and only
focus on the logic layout of data structures. For example, below is the denition
of binary search tree node in Haskell.
data Tree a = Empty
| Node (Tree a) a (Tree a)
1.3 Insertion
To insert a key k (may be along with a value in practice) to a binary search tree
T, we can follow a quite straight forward way.
If the tree is empty, then construct a leave node with key=k;
If k is less than the key of root node, insert it to the left child;
If k is greater than the key of root, insert it to the right child;
There is an exceptional case that if k is equal to the key of root, it means it
has already existed, we can either overwrite the data, or just do nothing. For
simple reason, this case is skipped in this post.
This algorithm is described recursively. It is so simple that is why we consider
binary search tree is hello world data structure. Formally, the algorithm can
be represented with a recursive function.
insert(T, k) =
_
_
_
node(, k, ) : T =
node(insert(L, k), Key, R) : k < Key
node(L, Key, insert(R, k)) : otherwise
(1.1)
Where
L = left(T)
R = right(T)
Key = key(T)
The node function creates a new node with given left sub-tree, a key and a
right sub-tree as parameters. means NIL or Empty. function left, right and
key are access functions which can get the left sub-tree, right sub-tree and the
key of a node.
Translate the above functions directly to Haskell yields the following pro-
gram.
insert::(Ord a) Tree a a Tree a
insert Empty k = Node Empty k Empty
insert (Node l x r) k | k < x = Node (insert l k) x r
| otherwise = Node l x (insert r k)
This program utilized the pattern matching features provided by the lan-
guage. However, even in functional settings without this feature, for instance,
Scheme/Lisp, the program is still expressive.
34CHAPTER 1. BINARY SEARCH TREE, THE HELLO WORLD DATA STRUCTURE
(define (insert tree x)
(cond ((null? tree) (list () x ()))
((< x (key tree))
(make-tree (insert (left tree) x)
(key tree)
(right tree)))
((> x (key tree))
(make-tree (left tree)
(key tree)
(insert (right tree) x)))))
It is possible to turn the algorithm completely into imperative way without
recursion.
1: function Insert(T, k)
2: root T
3: x Create-Leaf(k)
4: parent NIL
5: while T = NIL do
6: parent T
7: if k < Key(T) then
8: T Left(T)
9: else
10: T Right(T)
11: Parent(x) parent
12: if parent = NIL then tree T is empty
13: return x
14: else if k < Key(parent) then
15: Left(parent) x
16: else
17: Right(parent) x
18: return root
19: function Create-Leaf(k)
20: x Empty-Node
21: Key(x) k
22: Left(x) NIL
23: Right(x) NIL
24: Parent(x) NIL
25: return x
Compare with the functional algorithm, it is obviously that this one is more
complex although it is fast and can handle very deep tree. A complete C++
program and a python program are available along with this post for reference.
1.4 Traversing
Traversing means visiting every element one by one in a binary search tree.
There are 3 ways to traverse a binary tree, pre-order tree walk, in-order tree
walk, and post-order tree walk. The names of these traversing methods highlight
the order of when we visit the root of a binary search tree.
1.4. TRAVERSING 35
Since there are three parts in a tree, as left child, the root, which con-
tains the key and satellite data, and the right child. If we denote them as
(left, current, right), the three traversing methods are dened as the following.
pre-order traverse, visit current, then left, nally right;
in-order traverse, visit left , then current, nally right;
post-order traverse, visit left, then right, nally current.
Note that each visiting operation is recursive. And we see the order of
visiting current determines the name of the traversing method.
For the binary search tree shown in gure 1.2, below are the three dierent
traversing results.
pre-order traverse result: 4, 3, 1, 2, 8, 7, 16, 10, 9, 14;
in-order traverse result: 1, 2, 3, 4, 7, 8, 9, 10, 14, 16;
post-order traverse result: 2, 1, 3, 7, 9, 14, 10, 16, 8, 4;
It can be found that the in-order walk of a binary search tree outputs the
elements in increase order, which is particularly helpful. The denition of binary
search tree ensures this interesting property, while the proof of this fact is left
as an exercise of this post.
In-order tree walk algorithm can be described as the following:
If the tree is empty, just return;
traverse the left child by in-order walk, then access the key, nally traverse
the right child by in-order walk.
Translate the above description yields a generic map function
map(f, T) =
_
: T =
node(l
, k
, r
) : otherwise
(1.2)
where
l
= map(f, left(T))
r
= map(f, right(T))
k
= f(key(T))
If we only need access the key without create the transformed tree, we can
realize this algorithm in procedural way lie the below C++ program.
template<class T, class F>
void in_order_walk(node<T> t, F f){
if(t){
in_order_walk(tleft, f);
f(tvalue);
in_order_walk(tright, f);
}
}
36CHAPTER 1. BINARY SEARCH TREE, THE HELLO WORLD DATA STRUCTURE
The function takes a parameter f, it can be a real function, or a function
object, this program will apply f to the node by in-order tree walk.
We can simplied this algorithm one more step to dene a function which
turns a binary search tree to a sorted list by in-order traversing.
toList(T) =
_
: T =
toList(left(T)) {key(T)} toList(right(T)) : otherwise
(1.3)
Below is the Haskell program based on this denition.
toList::(Ord a)Tree a [a]
toList Empty = []
toList (Node l x r) = toList l ++ [x] ++ toList r
This provides us a method to sort a list of elements. We can rst build
a binary search tree from the list, then output the tree by in-order traversing.
This method is called as tree sort. Lets denote the list X = {x
1
, x
2
, x
3
, ..., x
n
}.
sort(X) = toList(fromList(X)) (1.4)
And we can write it in function composition form.
sort = toList.fromList
Where function fromList repeatedly insert every element to a binary search
tree.
fromList(X) = foldL(insert, , X) (1.5)
It can also be written in partial application form like below.
fromList = foldL insert
For the readers who are not familiar with folding from left, this function can
also be dened recursively as the following.
fromList(X) =
_
: X =
insert(fromList({x
2
, x
3
, ..., x
n
}), x
1
) : otherwise
Well intense use folding function as well as the function composition and
partial evaluation in the future, please refer to appendix of this book or [6] [7]
and [8] for more information.
Exercise 1.1
Given the in-order traverse result and pre-order traverse result, can you re-
construct the tree from these result and gure out the post-order traversing
result?
Pre-order result: 1, 2, 4, 3, 5, 6; In-order result: 4, 2, 1, 5, 3, 6; Post-order
result: ?
Write a program in your favorite language to re-construct the binary tree
from pre-order result and in-order result.
1.5. QUERYING A BINARY SEARCH TREE 37
Prove why in-order walk output the elements stored in a binary search
tree in increase order?
Can you analyze the performance of tree sort with big-O notation?
1.5 Querying a binary search tree
There are three types of querying for binary search tree, searching a key in the
tree, nd the minimum or maximum element in the tree, and nd the predecessor
or successor of an element in the tree.
1.5.1 Looking up
According to the denition of binary search tree, search a key in a tree can be
realized as the following.
If the tree is empty, the searching fails;
If the key of the root is equal to the value to be found, the search succeed.
The root is returned as the result;
If the value is less than the key of the root, search in the left child.
Else, which means that the value is greater than the key of the root, search
in the right child.
This algorithm can be described with a recursive function as below.
lookup(T, x) =
_
_
: T =
T : key(T) = x
lookup(left(T), x) : x < key(T)
lookup(right(T), x) : otherwise
(1.6)
In the real application, we may return the satellite data instead of the node
as the search result. This algorithm is simple and straightforward. Here is a
translation of Haskell program.
lookup::(Ord a) Tree a a Tree a
lookup Empty _ = Empty
lookup t@(Node l k r) x | k == x = t
| x < k = lookup l x
| otherwise = lookup r x
If the binary search tree is well balanced, which means that almost all nodes
have both non-NIL left child and right child, for N elements, the search algo-
rithm takes O(lg N) time to perform. This is not formal denition of balance.
Well show it in later post about red-black-tree. If the tree is poor balanced,
the worst case takes O(N) time to search for a key. If we denote the height of
the tree as h, we can uniform the performance of the algorithm as O(h).
The search algorithm can also be realized without using recursion in a pro-
cedural manner.
1: function Search(T, x)
38CHAPTER 1. BINARY SEARCH TREE, THE HELLO WORLD DATA STRUCTURE
2: while T = NIL Key(T) = x do
3: if x < Key(T) then
4: T Left(T)
5: else
6: T Right(T)
7: return T
Below is the C++ program based on this algorithm.
template<class T>
node<T> search(node<T> t, T x){
while(t && tkey!=x){
if(x < tkey) t=tleft;
else t=tright;
}
return t;
}
1.5.2 Minimum and maximum
Minimum and maximum can be implemented from the property of binary search
tree, less keys are always in left child, and greater keys are in right.
For minimum, we can continue traverse the left sub tree until it is empty.
While for maximum, we traverse the right.
min(T) =
_
key(T) : left(T) =
min(left(T)) : otherwise
(1.7)
max(T) =
_
key(T) : right(T) =
max(right(T)) : otherwise
(1.8)
Both function bound to O(h) time, where h is the height of the tree. For the
balanced binary search tree, min/max are bound to O(lg N) time, while they
are O(N) in the worst cases.
We skip translating them to programs, Its also possible to implement them
in pure procedural way without using recursion.
1.5.3 Successor and predecessor
The last kind of querying, to nd the successor or predecessor of an element
is useful when a tree is treated as a generic container and traversed by using
iterator. It will be relative easier to implement if parent of a node can be
accessed directly.
It seems that the functional solution is hard to be found, because there is no
pointer like eld linking to the parent node. One solution is to left breadcrumbs
when we visit the tree, and use these information to back-track or even re-
construct the whole tree. Such data structure, that contains both the tree and
breadcrumbs is called zipper. please refer to [9] for details.
However, If we consider the original purpose of providing succ/pred func-
tion, to traverse all the binary search tree elements one by one as a generic
container, we realize that they dont make signicant sense in functional settings
1.5. QUERYING A BINARY SEARCH TREE 39
because we can traverse the tree in increase order by mapT function we dened
previously.
Well meet many problems in this series of post that they are only valid in
imperative settings, and they are not meaningful problems in functional settings
at all. One good example is how to delete an element in red-black-tree[3].
In this section, well only present the imperative algorithm for nding the
successor and predecessor in a binary search tree.
When nding the successor of element x, which is the smallest one y that
satises y > x, there are two cases. If the node with value x has non-NIL right
child, the minimum element in right child is the answer; For example, in Figure
1.2, in order to nd the successor of 8, we search its right sub tree for the
minimum one, which yields 9 as the result. While if node x dont have right
child, we need back-track to nd the closest ancestors whose left child is also
ancestor of x. In Figure 1.2, since 2 dont have right sub tree, we go back to its
parent 1. However, node 1 dont have left child, so we go back again and reach
to node 3, the left child of 3, is also ancestor of 2, thus, 3 is the successor of
node 2.
Based on this description, the algorithm can be given as the following.
1: function Succ(x)
2: if Right(x) = NIL then
3: return Min(Right(x))
4: else
5: p Parent(x)
6: while p = NIL and x = Right(p) do
7: x p
8: p Parent(p)
9: return p
The predecessor case is quite similar to the successor algorithm, they are
symmetrical to each other.
1: function Pred(x)
2: if Left(x) = NIL then
3: return Max(Left(x))
4: else
5: p Parent(x)
6: while p = NIL and x = Left(p) do
7: x p
8: p Parent(p)
9: return p
Below are the Python programs based on these algorithms. They are changed
a bit in while loop conditions.
def succ(x):
if x.right is not None: return tree_min(x.right)
p = x.parent
while p is not None and p.left != x:
x = p
p = p.parent
return p
def pred(x):
40CHAPTER 1. BINARY SEARCH TREE, THE HELLO WORLD DATA STRUCTURE
if x.left is not None: return tree_max(x.left)
p = x.parent
while p is not None and p.right != x:
x = p
p = p.parent
return p
Exercise 1.2
Can you gure out how to iterate a tree as a generic container by using
pred()/succ()? Whats the performance of such traversing process in terms
of big-O?
A reader discussed about traversing all elements inside a range [a, b]. In
C++, the algorithm looks like the below code:
for each(m.lower bound(12), m.upper bound(26), f);
Can you provide the purely function solution for this problem?
1.6 Deletion
Deletion is another imperative only topic for binary search tree. This is because
deletion mutate the tree, while in purely functional settings, we dont modify
the tree after building it in most application.
However, One method of deleting element from binary search tree in purely
functional way is shown in this section. Its actually reconstructing the tree but
not modifying the tree.
Deletion is the most complex operation for binary search tree. this is because
we must keep the BST property, that for any node, all keys in left sub tree are
less than the key of this node, and they are all less than any keys in right sub
tree. Deleting a node can break this property.
In this post, dierent with the algorithm described in [2], A simpler one from
SGI STL implementation is used.[6]
To delete a node x from a tree.
If x has no child or only one child, splice x out;
Otherwise (x has two children), use minimum element of its right sub tree
to replace x, and splice the original minimum element out.
The simplicity comes from the truth that, the minimum element is stored in
a node in the right sub tree, which cant have two non-NIL children. It ends up
in the trivial case, the the node can be directly splice out from the tree.
Figure 1.5, 1.6, and 1.7 illustrate these dierent cases when deleting a node
from the tree.
1.6. DELETION 41
Tree
x
NIL NIL
Figure 1.5: x can be spliced out.
Tree
x
L NIL
(a) Before delete x
Tree
L
(b) After delete x
x is spliced out, and replaced by its left child.
Tree
x
NIL R
(c) Before delete x
Tree
R
(d) Before delete x
x is spliced out, and replaced by its right child.
Figure 1.6: Delete a node which has only one non-NIL child.
42CHAPTER 1. BINARY SEARCH TREE, THE HELLO WORLD DATA STRUCTURE
Tree
x
L R
(a) Before delete x
Tree
min(R)
L delete(R, min(R))
(b) After delete x
x is replaced by splicing the minimum element from its right child.
Figure 1.7: Delete a node which has both children.
Based on this idea, the deletion can be dened as the below function.
delete(T, x) =
_
_
: T =
node(delete(L, x), K, R) : x < K
node(L, K, delete(R, x)) : x > K
R : x = K L =
L : x = K R =
node(L, y, delete(R, y)) : otherwise
(1.9)
Where
L = left(T)
R = right(T)
K = key(T)
y = min(R)
Translating the function to Haskell yields the below program.
delete::(Ord a) Tree a a Tree a
delete Empty _ = Empty
delete (Node l k r) x | x < k = (Node (delete l x) k r)
| x > k = (Node l k (delete r x))
-- x == k
| isEmpty l = r
| isEmpty r = l
| otherwise = (Node l k (delete r k))
where k = min r
Function isEmpty is used to test if a tree is empty (). Note that the
algorithm rst performs search to locate the node where the element need be
deleted, after that it execute the deletion. This algorithm takes O(h) time where
h is the height of the tree.
1.6. DELETION 43
Its also possible to pass the node but not the element to the algorithm for
deletion. Thus the searching is no more needed.
The imperative algorithm is more complex because it need set the parent
properly. The function will return the root of the result tree.
1: function Delete(T, x)
2: root T
3: x
x save x
4: parent Parent(x)
5: if Left(x) = NIL then
6: x Right(x)
7: else if Right(x) = NIL then
8: x Left(x)
9: else both children are non-NIL
10: y Min(Right(x))
11: Key(x) Key(y)
12: Copy other satellite data from y to x
13: if Parent(y) = x then y hasnt left sub tree
14: Left(Parent(y)) Right(y)
15: else y is the root of right child of x
16: Right(x) Right(y)
17: Remove y
18: return root
19: if x = NIL then
20: Parent(x) parent
21: if parent = NIL then We are removing the root of the tree
22: root x
23: else
24: if Left(parent) = x
then
25: Left(parent) x
26: else
27: Right(parent) x
28: Remove x
1
, A
2
, ..., A
i1
}
At any time, when we process the i-th element, all elements before i have
already been sorted. we continuously insert the current elements until consume
all the unsorted data. This idea is illustrated as in gure 9.3.
... sorted elements ... x
insert
... unsorted elements ...
Figure 2.2: The left part is sorted data, continuously insert elements to sorted
part.
We can nd there is recursive concept in this denition. Thus it can be
expressed as the following.
sort(A) =
_
: A =
insert(sort({A
2
, A
3
, ...}), A
1
) : otherwise
(2.2)
2.2 Insertion
We havent answered the question about how to realize insertion however. Its
a puzzle how does human locate the proper position so quickly.
For computer, its an obvious option to perform a scan. We can either scan
from left to right or vice versa. However, if the sequence is stored in plain array,
its necessary to scan from right to left.
function Sort(A)
for i 2 to Length(A) do Insert A[i] to sorted sequence A[1...i 1]
2.2. INSERTION 51
x A[i]
j i 1
while j > 0 x < A[j] do
A[j + 1] A[j]
j j 1
A[j + 1] x
One may think scan from left to right is natural. However, it isnt as eect
as above algorithm for plain array. The reason is that, its expensive to insert an
element in arbitrary position in an array. As array stores elements continuously.
If we want to insert new element x in position i, we must shift all elements after
i, including i +1, i +2, ... one position to right. After that the cell at position i
is empty, and we can put x in it. This is illustrated in gure 2.3.
A[1] A[2] ... A[i-1] A[i] A[i+1] A[i+2] ... A[n-1] A[n] empty
x
insert
Figure 2.3: Insert x to array A at position i.
If the length of array is N, this indicates we need examine the rst i elements,
then perform N i + 1 moves, and then insert x to the i-th cell. So insertion
from left to right need traverse the whole array anyway. While if we scan from
right to left, we totally examine the last j = N i + 1 elements, and perform
the same amount of moves. If j is small (e.g. less than N/2), there is possibility
to perform less operations than scan from left to right.
Translate the above algorithm to Python yields the following code.
def isort(xs):
n = len(xs)
for i in range(1, n):
x = xs[i]
j = i - 1
while j 0 and x < xs[j]:
xs[j+1] = xs[j]
j = j - 1
xs[j+1] = x
It can be found some other equivalent programs, for instance the following
ANSI C program. However this version isnt as eective as the pseudo code.
void isort(Key xs, int n){
int i, j;
for(i=1; i<n; ++i)
for(j=i-1; j0 && xs[j+1] < xs[j]; --j)
swap(xs, j, j+1);
}
52 CHAPTER 2. THE EVOLUTION OF INSERTION SORT
This is because the swapping function, which can exchange two elements
typically uses a temporary variable like the following:
void swap(Key xs, int i, int j){
Key temp = xs[i];
xs[i] = xs[j];
xs[j] = temp;
}
So the ANSI C program presented above takes 3M times assignment, where
M is the number of inner loops. While the pseudo code as well as the Python
program use shift operation instead of swapping. There are N +2 times assign-
ment.
We can also provide Insert() function explicitly, and call it from the general
insertion sort algorithm in previous section. We skip the detailed realization here
and left it as an exercise.
All the insertion algorithms are bound to O(N), where N is the length of
the sequence. No matter what dierence among them, such as scan from left
or from right. Thus the over all performance for insertion sort is quadratic as
O(N
2
).
Exercise 2.1
Provide explicit insertion function, and call it with general insertion sort
algorithm. Please realize it in both procedural way and functional way.
2.3 Improvement 1
Lets go back to the question, that why human being can nd the proper position
for insertion so quickly. We have shown a solution based on scan. Note the fact
that at any time, all cards at hands have been well sorted, another possible
solution is to use binary search to nd that location.
Well explain the search algorithms in other dedicated chapter. Binary search
is just briey introduced for illustration purpose here.
The algorithm will be changed to call a binary search procedure.
function Sort(A)
for i 2 to Length(A) do
x A[i]
p Binary-Search(A[1...i 1], x)
for j i down to p do
A[j] A[j 1]
A[p] x
Instead of scan elements one by one, binary search utilize the information
that all elements in slice of array {A
1
, ..., A
i1
} are sorted. Lets assume the
order is monotonic increase order. To nd a position j that satises A
j1
x A
j
. We can rst examine the middle element, for example, A
i/2
. If x is
less than it, we need next recursively perform binary search in the rst half of
the sequence; otherwise, we only need search in last half.
Every time, we halve the elements to be examined, this search process runs
O(lg N) time to locate the insertion position.
2.4. IMPROVEMENT 2 53
function Binary-Search(A, x)
l 1
u 1+ Length(A)
while l < u do
m
l+u
2
if A
m
= x then
return m Find a duplicated element
else if A
m
< x then
l m+ 1
else
u m
return l
The improved insertion sort algorithm is still bound to O(N
2
), compare to
previous section, which we use O(N
2
) times comparison and O(N
2
) moves, with
binary search, we just use O(N lg N) times comparison and O(N
2
) moves.
The Python program regarding to this algorithm is given below.
def isort(xs):
n = len(xs)
for i in range(1, n):
x = xs[i]
p = binary_search(xs[:i], x)
for j in range(i, p, -1):
xs[j] = xs[j-1]
xs[p] = x
def binary_search(xs, x):
l = 0
u = len(xs)
while l < u:
m = (l+u)/2
if xs[m] == x:
return m
elif xs[m] < x:
l = m + 1
else:
u = m
return l
Exercise 2.2
Write the binary search in recursive manner. You neednt use purely func-
tional programming language.
2.4 Improvement 2
Although we improve the search time to O(N lg N) in previous section, the
number of moves is still O(N
2
). The reason of why movement takes so long
time, is because the sequence is stored in plain array. The nature of array
is continuously layout data structure, so the insertion operation is expensive.
54 CHAPTER 2. THE EVOLUTION OF INSERTION SORT
This hints us that we can use linked-list setting to represent the sequence. It
can improve the insertion operation from O(N) to constant time O(1).
insert(A, x) =
_
_
_
{x} : A =
{x} A : x < A
1
{A
1
} insert({A
2
, A
3
, ...A
n
}, x) : otherwise
(2.3)
Translating the algorithm to Haskell yields the below program.
insert :: (Ord a) [a] a [a]
insert [] x = [x]
insert (y:ys) x = if x < y then x:y:ys else y:insert ys x
And we can complete the two versions of insertion sort program based on
the rst two equations in this chapter.
isort [] = []
isort (x:xs) = insert (isort xs) x
Or we can represent the recursion with folding.
isort = foldl insert []
Linked-list setting solution can also be described imperatively. Suppose
function Key(x), returns the value of element stored in node x, and Next(x)
accesses the next node in the linked-list.
function Insert(L, x)
p NIL
H L
while L = NIL Key(L) < Key(x) do
p L
L Next(L)
Next(x) L
if p = NIL then
H x
else
Next(p) x
return H
For example in ANSI C, the linked-list can be dened as the following.
struct node{
Key key;
struct node next;
};
Thus the insert function can be given as below.
struct node insert(struct node lst, struct node x){
struct node p, head;
p = NULL;
for(head = lst; lst && xkey > lstkey; lst = lstnext)
p = lst;
xnext = lst;
if(!p)
return x;
2.5. FINAL IMPROVEMENT BY BINARY SEARCH TREE 55
pnext = x;
return head;
}
Instead of using explicit linked-list such as by pointer or reference based
structure. Linked-list can also be realized by another index array. For any
array element A
i
, Next
i
stores the index of next element follows A
i
. It means
A
Next
i
is the next element after A
i
.
The insertion algorithm based on this solution is given like below.
function Insert(A, Next, i)
j
while Next
j
= NIL A
Next
j
< A
i
do
j Next
j
Next
i
Next
j
Next
j
i
Here means the head of the Next table. And the relative Python program
for this algorithm is given as the following.
def isort(xs):
n = len(xs)
next = [-1](n+1)
for i in range(n):
insert(xs, next, i)
return next
def insert(xs, next, i):
j = -1
while next[j] != -1 and xs[next[j]] < xs[i]:
j = next[j]
next[j], next[i] = i, next[j]
Although we change the insertion operation to constant time by using linked-
list. However, we have to traverse the linked-list to nd the position, which re-
sults O(N
2
) times comparison. This is because linked-list, unlike array, doesnt
support random access. It means we cant use binary search with linked-list
setting.
Exercise 2.3
Complete the insertion sort by using linked-list insertion function in your
favorate imperative programming language.
The index based linked-list return the sequence of rearranged index as
result. Write a program to re-order the original array of elements from
this result.
2.5 Final improvement by binary search tree
It seems that we drive into a corner. We must improve both the comparison
and the insertion at the same time, or we will end up with O(N
2
) performance.
56 CHAPTER 2. THE EVOLUTION OF INSERTION SORT
We must use binary search, this is the only way to improve the comparison
time to O(lg N). On the other hand, we must change the data structure, because
we cant achieve constant time insertion at a position with plain array.
This remind us about our hello world data structure, binary search tree. It
naturally support binary search from its denition. At the same time, We can
insert a new leaf in binary search tree in O(1) constant time if we already nd
the location.
So the algorithm changes to this.
function Sort(A)
T
for each x A do
T Insert-Tree(T, x)
return To-List(T)
Where Insert-Tree() and To-List() are described in previous chapter
about binary search tree.
As we have analyzed for binary search tree, the performance of tree sort is
bound to O(N lg N), which is the lower limit of comparison based sort[3].
2.6 Short summary
In this chapter, we present the evolution process of insertion sort. Insertion sort
is well explained in most textbooks as the rst sorting algorithm. It has simple
and straightforward idea, but the performance is quadratic. Some textbooks
stop here, but we want to show that there exist ways to improve it by dierent
point of view. We rst try to save the comparison time by using binary search,
and then try to save the insertion operation by changing the data structure to
linked-list. Finally, we combine these two ideas and evolute insertion sort to
tree sort.
Bibliography
[1] http://en.wikipedia.org/wiki/Bubble sort
[2] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Cliord
Stein. Introduction to Algorithms, Second Edition. ISBN:0262032937.
The MIT Press. 2001
[3] Donald E. Knuth. The Art of Computer Programming, Volume 3: Sorting
and Searching (2nd Edition). Addison-Wesley Professional; 2 edition (May
4, 1998) ISBN-10: 0201896850 ISBN-13: 978-0201896855
Red-black tree, not so complex as it was thought
Larry LIU Xinyu Larry LIU Xinyu
Email: [email protected]
57
58 Red black tree
Chapter 3
Red-black tree, not so
complex as it was thought
3.1 Introduction
3.1.1 Exploit the binary search tree
We have shown the power of binary search tree by using it to count the occur-
rence of every word in Bible. The idea is to use binary search tree as a dictionary
for counting.
One may come to the idea that to feed a yellow page book
1
to a binary
search tree, and use it to look up the phone number for a contact.
By modifying a bit of the program for word occurrence counting yields the
following code.
int main(int, char ){
ifstream f("yp.txt");
map<string, string> dict;
string name, phone;
while(f>>name && f>>phone)
dict[name]=phone;
for(;;){
cout<<"nname:";
cin>>name;
if(dict.find(name)==dict.end())
cout<<"notfound";
else
cout<<"phone:"<<dict[name];
}
}
This program works well. However, if you replace the STL map with the bi-
nary search tree as mentioned previously, the performance will be bad, especially
when you search some names such as Zara, Zed, Zulu.
This is because the content of yellow page is typically listed in lexicographic
order. Which means the name list is in increase order. If we try to insert a
1
A name-phone number contact list book
59
60CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
sequence of number 1, 2, 3, ..., n to a binary search tree. We will get a tree like
in Figure 3.1.
1
2
3
...
n
Figure 3.1: unbalanced tree
This is a extreme unbalanced binary search tree. The looking up performs
O(h) for a tree with height h. In balanced case, we benet from binary search
tree by O(lg N) search time. But in this extreme case, the search time down-
graded to O(N). Its no better than a normal link-list.
Exercise 3.1
For a very big yellow page list, one may want to speed up the dictionary
building process by two concurrent tasks (threads or processes). One task
reads the name-phone pair from the head of the list, while the other one
reads from the tail. The building terminates when these two tasks meet
at the middle of the list. What will be the binary search tree looks like
after building? What if you split the the list more than two and use more
tasks?
Can you nd any more cases to exploit a binary search tree? Please
consider the unbalanced trees shown in gure 3.2.
3.1.2 How to ensure the balance of the tree
In order to avoid such case, we can shue the input sequence by randomized
algorithm, such as described in Section 12.4 in [2]. However, this method doesnt
always work, for example the input is fed from user interactively, and the tree
need to be built/updated after each input.
There are many solutions people have ever found to make binary search tree
balanced. Many of them rely on the rotation operations to binary search tree.
Rotation operations change the tree structure while maintain the ordering of
the elements. Thus it either improve or keep the balance property of the binary
search tree.
3.1. INTRODUCTION 61
n
n-1
n-2
...
1
(a)
1
2
n
3
n-1
4
...
(b)
m
m-1 m+1
m-2
...
1
m+2
...
n
(c)
Figure 3.2: Some unbalanced trees
62CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
In this chapter, well rst introduce about red-black tree, which is one of
the most popular and widely used self-adjusting balanced binary search tree. In
next chapter, well introduce about AVL tree, which is another intuitive solution;
In later chapter about binary heaps, well show another interesting tree called
splay tree, which can gradually adjust the the tree to make it more and more
balanced.
3.1.3 Tree rotation
X
a Y
b c
(a)
Y
X c
a b
(b)
Figure 3.3: Tree rotation, rotate-left transforms the tree from left side to right
side, and rotate-right does the inverse transformation.
Tree rotation is a kind of special operation that can transform the tree
structure without changing the in-order traverse result. It based on the fact
that for a specied ordering, there are multiple binary search trees correspond
to it. Figure 3.3 shows the tree rotation. For a binary search tree on the left
side, left rotate transforms it to the tree on the right, and right rotate does the
inverse transformation.
Although tree rotation can be realized in procedural way, there exists quite
simple function description if using pattern matching.
rotateL(T) =
_
node(node(a, X, b), Y, c) : pattern(T) = node(a, X, node(b, Y, c))
T : otherwise
(3.1)
rotateR(T) =
_
node(a, X, node(b, Y, c)) : pattern(T) = node(node(a, X, b), Y, c))
T : otherwise
(3.2)
However, the pseudo code dealing imperatively has to set all elds accord-
ingly.
1: function Left-Rotate(T, x)
2: p Parent(x)
3: y Right(x) Assume y = NIL
4: a Left(x)
5: b Left(y)
6: c Right(y)
3.1. INTRODUCTION 63
7: Replace(x, y)
8: Set-Children(x, a, b)
9: Set-Children(y, x, c)
10: if p = NIL then
11: T y
12: return T
13: function Right-Rotate(T, y)
14: p Parent(y)
15: x Left(y) Assume x = NIL
16: a Left(x)
17: b Right(x)
18: c Right(y)
19: Replace(y, x)
20: Set-Children(y, b, c)
21: Set-Children(x, a, y)
22: if p = NIL then
23: T x
24: return T
25: function Set-Left(x, y)
26: Left(x) y
27: if y = NIL then Parent(y) x
28: function Set-Right(x, y)
29: Right(x) y
30: if y = NIL then Parent(y) x
31: function Set-Children(x, L, R)
32: Set-Left(x, L)
33: Set-Right(x, R)
34: function Replace(x, y)
35: if Parent(x) = NIL then
36: if y = NIL then Parent(y) NIL
37: else if Left(Parent(x)) = x then Set-Left(Parent(x), y)
38: elseSet-Right(Parent(x), y)
39: Parent(x) NIL
Compare these pseudo codes with the pattern matching functions, the former
focus on the structure states changing, while the latter focus on the rotation
process. As the title of this chapter indicated, red-black tree neednt be so
complex as it was thought. Most traditional algorithm text books use the classic
procedural way to teach red-black tree, there are several cases need to deal and
all need carefulness to manipulate the node elds. However, by changing the
mind to functional settings, things get intuitive and simple. Although there is
some performance overhead.
Most of the content in this chapter is based on Chris Okasakis work in [2].
64CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
3.2 Denition of red-black tree
Red-black tree is a type of self-balancing binary search tree[3].
2
By using color
changing and rotation, red-black tree provides a very simple and straightforward
way to keep the tree balanced.
For a binary search tree, we can augment the nodes with a color eld, a node
can be colored either red or black. We call a binary search tree red-black tree
if it satises the following 5 properties[2].
1. Every node is either red or black.
2. The root is black.
3. Every leaf (NIL) is black.
4. If a node is red, then both its children are black.
5. For each node, all paths from the node to descendant leaves contain the
same number of black nodes.
Why this 5 properties can ensure the red-black tree is well balanced? Because
they have a key characteristic, the longest path from root to a leaf cant be as
2 times longer than the shortest path.
Please note the 4-th property, which means there wont be two adjacent red
nodes. so the shortest path only contains black nodes, any paths longer than
the shortest one has interval red nodes. According to property 5, all paths have
the same number of black nodes, this nally ensure there wont be any path is
2 times longer than others[3]. Figure 3.4 shows an example red-black tree.
13
8 17
1 11
6
15 25
22 27
Figure 3.4: An example red-black tree
2
Red-black tree is one of the equivalent form of 2-3-4 tree (see chapter B-tree about 2-3-4
tree). That is to say, for any 2-3-4 tree, there is at least one red-black tree has the same data
order.
3.3. INSERTION 65
All read only operations such as search, min/max are as same as in binary
search tree. While only the insertion and deletion are special.
As we have shown in word occurrence example, many implementation of
set or map container are based on red-black tree. One example is the C++
Standard library (STL)[6].
As mentioned previously, the only change in data layout is that there is color
information augmented to binary search tree. This can be represented as a data
eld in imperative languages such as C++ like below.
enum Color {Red, Black};
template <class T>
struct node{
Color color;
T key;
node left;
node right;
node parent;
};
In functional settings, we can add the color information in constructors,
below is the Haskell example of red-black tree denition.
data Color = R | B
data RBTree a = Empty
| Node Color (RBTree a) a (RBTree a)
Exercise 3.2
Can you prove that a red-black tree with n nodes has height at most
2 lg(n + 1)?
3.3 Insertion
Inserting a new node as what has been mentioned in binary search tree may
cause the tree unbalanced. The red-black properties has to be maintained, so
we need do some xing by transform the tree after insertion.
When we insert a new key, one good practice is to always insert it as a red
node. As far as the new inserted node isnt the root of the tree, we can keep all
properties except the 4-th one. that it may bring two adjacent red nodes.
Functional and procedural implementation have dierent xing methods.
One is intuitive but has some overhead, the other is a bit complex but has
higher performance. Most text books about algorithm introduce the later one.
In this chapter, we focus on the former to show how easily a red-black tree
insertion algorithm can be realized. The traditional procedural method will be
given only for comparison purpose.
As described by Chris Okasaki, there are total 4 cases which violate property
4. All of them has 2 adjacent red nodes. However, they have a uniformed form
after xing[2] as shown in gure 4.3.
Note that this transformation will move the redness one level up. So this is a
bottom-up recursive xing, the last step will make the root node red. According
66CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
@
@
@R
@
@
@I
z
y D
x C
A B
z
x D
A y
B C
x
A y
B z
C D
x
A z
y D
B C
y
x z
A B C D
Figure 3.5: 4 cases for balancing a red-black tree after insertion
3.3. INSERTION 67
to property 2, root is always black. Thus we need nal xing to revert the root
color to black.
Observing that the 4 cases and the xed result have strong pattern features,
the xing function can be dened by using the similar method we mentioned in
tree rotation. To avoid too long formula, we abbreviate Color as C, Black as
B, and Red as R.
balance(T) =
_
node(R, node(B, A, x, B), y, node(B, C, z, D)) : match(T)
T : otherwise
(3.3)
where function node() can construct a red-black tree node with 4 parameters
as color, the left child, the key and the right child. Function match() can test
if a tree is one of the 4 possible patterns as the following.
match(T) =
pattern(T) = node(B, node(R, node(R, A, x, B), y, C), z, D)
pattern(T) = node(B, node(R, A, x, node(R, B, y, C), z, D))
pattern(T) = node(B, A, x, node(R, B, y, node(R, C, z, D)))
pattern(T) = node(B, A, x, node(R, node(R, B, y, C), z, D))
With function balance() dened, we can modify the previous binary search
tree insertion functions to make it work for red-black tree.
insert(T, k) = makeBlack(ins(T, k)) (3.4)
where
ins(T, k) =
_
_
_
node(R, , k, ) : T =
balance(node(ins(L, k), Key, R)) : k < Key
balance(node(L, Key, ins(R, k))) : otherwise
(3.5)
L, R, Key represent the left child, right child and the key of a tree.
L = left(T)
R = right(T)
Key = key(T)
Function makeBlack() is dened as the following, it forces the color of a
non-empty tree to be black.
makeBlack(T) = node(B, L, Key, R) (3.6)
Summarize the above functions and use language supported pattern match-
ing features, we can come to the following Haskell program.
insert::(Ord a)RBTree a a RBTree a
insert t x = makeBlack $ ins t where
ins Empty = Node R Empty x Empty
ins (Node color l k r)
| x < k = balance color (ins l) k r
| otherwise = balance color l k (ins r) --[3]
makeBlack(Node _ l k r) = Node B l k r
68CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
balance::Color RBTree a a RBTree a RBTree a
balance B (Node R (Node R a x b) y c) z d =
Node R (Node B a x b) y (Node B c z d)
balance B (Node R a x (Node R b y c)) z d =
Node R (Node B a x b) y (Node B c z d)
balance B a x (Node R b y (Node R c z d)) =
Node R (Node B a x b) y (Node B c z d)
balance B a x (Node R (Node R b y c) z d) =
Node R (Node B a x b) y (Node B c z d)
balance color l k r = Node color l k r
Note that the balance function is changed a bit from the original denition.
Instead of passing the tree, we pass the color, the left child, the key and the
right child to it. This can save a pair of boxing and un-boxing operations.
This program doesnt handle the case of inserting a duplicated key. However,
it is possible to handle it either by overwriting, or skipping. Another option is
to augment the data with a linked list[2].
Figure 3.6 shows two red-black trees built from feeding list 11, 2, 14, 1, 7,
15, 5, 8, 4 and 1, 2, ..., 8.
14
7
2
1 5 11 15
4 8
1
2
3
4
6
5 7
8
Figure 3.6: insert results generated by the Haskell algorithm
This algorithm shows great simplicity by summarizing the uniform feature
from the four dierent unbalanced cases. It is expressive over the traditional tree
rotation approach, that even in programming languages which dont support
pattern matching, the algorithm can still be implemented by manually check
the pattern. A Scheme/Lisp program is available along with this book can be
referenced as an example.
The insertion algorithm takes O(lg N) time to insert a key to a red-black
tree which has N nodes.
Exercise 3.3
Write a program in an imperative language, such as C, C++ or python
to realize the same algorithm in this section. Note that, because there is
no language supported pattern matching, you need to test the 4 dierent
cases manually.
3.4. DELETION 69
3.4 Deletion
Remind the deletion section in binary search tree. Deletion is imperative only
for red-black tree as well. In typically practice, it often builds the tree just one
time, and performs looking up frequently after that. Okasaki explained why
he didnt provide red-black tree deletion in his work in [3]. One reason is that
deletions are much messier than insertions.
The purpose of this section is just to show that red-black tree deletion is
possible in purely functional settings, although it actually rebuilds the tree
because trees are read only in terms of purely functional data structure. In real
world, its up to the user (or actually the programmer) to adopt the proper
solution. One option is to mark the node be deleted with a ag, and perform
a tree rebuilding when the number of deleted nodes exceeds 50% of the total
number of nodes.
Not only in functional settings, even in imperative settings, deletion is more
complex than insertion. We face more cases to x. Deletion may also violate the
red black tree properties, so we need x it after the normal deletion as described
in binary search tree.
The deletion algorithm in this book are based on top of a handout in [5].
The problem only happens if you try to delete a black node, because it will
violate the 4-th property of red-black tree, which means the number of black
node in the path may decreased so that it is not uniformed black-height any
more.
When delete a black node, we can resume red-black property number 4 by
introducing a doubly-black concept[2]. It means that the although the node
is deleted, the blackness is kept by storing it in the parent node. If the parent
node is red, it turns to black, However, if it has been already black, it turns to
doubly-black.
In order to express the doubly-black node, The denition need some mod-
ication accordingly.
data Color = R | B | BB -- BB: doubly black for deletion
data RBTree a = Empty | BBEmpty -- doubly black empty
| Node Color (RBTree a) a (RBTree a)
When deleting a node, we rst perform the same deleting algorithm in bi-
nary search tree mentioned in previous chapter. After that, if the node to be
sliced out is black, we need x the tree to keep the red-black properties. Lets
denote the empty tree as , and for non-empty tree, it can be decomposed to
node(Color, L, Key, R) for its color, left sub-tree, key and the right sub-tree.
The delete function is dened as the following.
delete(T, k) = blackenRoot(del(T, k)) (3.7)
70CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
where
del(T, k) =
_
_
: T =
fixBlack
2
(node(C, del(L, k), Key, R)) : k < Key
fixBlack
2
(node(C, L, Key, del(R, k))) : k > Key
=
_
mkBlk(R) : C = B
R : otherwise
: L =
=
_
mkBlk(L) : C = B
L : otherwise
: R =
fixBlack
2
(node(C, L, k
, del(R, k
))) : otherwise
(3.8)
The real deleting happens inside function del. For the trivial case, that the
tree is empty, the deletion result is ; If the key to be deleted is less than the
key of the current node, we recursively perform deletion on its left sub-tree; if
it is bigger than the key of the current node, then we recursively delete the key
from the right sub-tree; Because it may bring doubly-blackness, so we need x
it.
If the key to be deleted is equal to the key of the current node, we need
splice it out. If one of its children is empty, we just replace the node by the
other one and reserve the blackness of this node. otherwise we cut and past the
minimum element k
_
: T =
node(B, L, Key, R) : C = R
node(B
2
, L, Key, R) : C = B
T : otherwise
(3.10)
where means doubly-black empty node and B
2
is the doubly-black color.
Summarizing the above functions yields the following Haskell program.
delete::(Ord a)RBTree a a RBTree a
delete t x = blackenRoot(del t x) where
del Empty _ = Empty
del (Node color l k r) x
| x < k = fixDB color (del l x) k r
| x > k = fixDB color l k (del r x)
-- x == k, delete this node
| isEmpty l = if color==B then makeBlack r else r
3.4. DELETION 71
| isEmpty r = if color==B then makeBlack l else l
| otherwise = fixDB color l k (del r k) where k= min r
blackenRoot (Node _ l k r) = Node B l k r
blackenRoot _ = Empty
makeBlack::RBTree a RBTree a
makeBlack (Node B l k r) = Node BB l k r -- doubly black
makeBlack (Node _ l k r) = Node B l k r
makeBlack Empty = BBEmpty
makeBlack t = t
The nal attack to the red-black tree deletion algorithm is to realize the
fixBlack
2
function. The purpose of this function is to eliminate the doubly-
black colored node by rotation and color changing.
Lets solve the doubly-black empty node rst. For any node, If one of its
child is doubly-black empty, and the other child is non-empty, we can safely
replace the doubly-black empty with a normal empty node.
Like gure 3.7, if we are going to delete the node 4 from the tree (Instead
show the whole tree, only part of the tree is shown), the program will use a
doubly-black empty node to replace node 4. In the gure, the doubly-black
node is shown in black circle with 2 edges. It means that for node 5, it has a
doubly-black empty left child and has a right non-empty child (a leaf node with
key 6). In such case we can safely change the doubly-black empty to normal
empty node. which wont violate any red-black properties.
3
2 5
1 4 6
(a) Delete 4 from the tree.
3
2 5
1 NIL 6
(b) After 4 is sliced o, it is doubly-black empty.
3
2 5
1 6
(c) We can safely change it to normal NIL.
Figure 3.7: One child is doubly-black empty node, the other child is non-empty
On the other hand, if a node has a doubly-black empty node and the other
child is empty, we have to push the doubly-blackness up one level. For example
72CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
in gure 3.8, if we want to delete node 1 from the tree, the program will use a
doubly-black empty node to replace 1. Then node 2 has a doubly-black empty
node and has an empty right node. In such case we must mark node 2 as
doubly-black after change its left child back to empty.
3
2 5
1 4 6
(a) Delete 1 from the tree.
3
2 5
NIL 4 6
(b) After 1 is sliced o, it is doubly-black empty.
3
2 5
4 6
(c) We must push the doubly-blackness up
to node 2.
Figure 3.8: One child is doubly-black empty node, the other child is empty.
Based on above analysis, in order to x the doubly-black empty node, we
dene the function partially like the following.
fixBlack
2
(T) =
_
_
node(B
2
, , Key, ) : (L = R = ) (L = R = )
node(C, , Key, R) : L = R =
node(C, L, Key, ) : R = L =
... : ...
(3.11)
After dealing with doubly-black empty node, we need to x the case that
the sibling of the doubly-black node is black and it has one red child. In this
situation, we can x the doubly-blackness with one rotation. Actually there are
4 dierent sub-cases, all of them can be transformed to one uniformed pattern.
They are shown in the gure 3.9. These cases are described in [2] as case 3 and
case 4.
3.4. DELETION 73
Figure 3.9: Fix the doubly black by rotation, the sibling of the doubly-black
node is black, and it has one red child.
74CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
The handling of these 4 sub-cases can be dened on top of formula 3.11.
fixBlack
2
(T) =
_
_
... : ...
node(C, node(B, mkBlk(A), x, B), y, node(B, C, z, D)) : p1.1
node(C, node(B, A, x, B), y, node(B, C, z, mkBlk(D))) : p1.2
... : ...
(3.12)
where p1.1 and p1.2 each represent 2 patterns as the following.
p1.1 =
_
_
_
node(C, A, x, node(B, node(R, B, y, C), z, D)) Color(A) = B
2
_
... : ...
mkBlk(node(C, mkBlk(A), x, node(R, B, y, C))) : p2.1
mkBlk(node(C, node(R, A, x, B), y, mkBlk(C))) : p2.2
... : ...
(3.13)
where p2.1 and p2.2 are two patterns as below.
p2.1 =
_
node(C, A, x, node(B, B, y, C))
Color(A) = B
2
Color(B) = Color(C) = B
_
p2.2 =
_
node(C, node(B, A, x, B), y, C)
Color(C) = B
2
Color(A) = Color(B) = B
_
There is a nal case left, that the sibling of the doubly-black node is red.
We can do a rotation to change this case to pattern p1.1 or p1.2. Figure 3.11
shows about it.
We can nish formula 3.13 with 3.14.
fixBlack
2
(T) =
_
_
... : ...
fixBlack
2
(B, fixBlack
2
(node(R, A, x, B), y, C) : p3.1
fixBlack
2
(B, A, x, fixBlack
2
(node(R, B, y, C)) : p3.2
T : otherwise
(3.14)
3.4. DELETION 75
=
x
a y
b c
(a) Color of x can be either black or red.
x
a y
b c
(b) If x was red, then it becomes black, oth-
erwise, it becomes doubly-black.
=
y
x c
a b
(c) Color of y can be either black or red.
y
x c
a b
(d) If y was red, then it becomes black, oth-
erwise, it becomes doubly-black.
Figure 3.10: propagate the blackness up.
Figure 3.11: The sibling of the doubly-black node is red.
76CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
where p3.1 and p3.2 are two patterns as the following.
p3.1 = {Color(T) = B Color(L) = B
2
Color(R) = R}
p3.2 = {Color(T) = B Color(L) = R Color(R) = B
2
}
This two cases are described in [2] as case 1.
Fixing the doubly-black node with all above dierent cases is a recursive
function. There are two termination conditions. One contains pattern p1.1 and
p1.2, The doubly-black node was eliminated. The other cases may continuously
propagate the doubly-blackness from bottom to top till the root. Finally the
algorithm marks the root node as black anyway. The doubly-blackness will be
removed.
Put formula 3.11, 3.12, 3.13, and 3.14 together, we can write the nal Haskell
program.
fixDB::Color RBTree a a RBTree a RBTree a
fixDB color BBEmpty k Empty = Node BB Empty k Empty
fixDB color BBEmpty k r = Node color Empty k r
fixDB color Empty k BBEmpty = Node BB Empty k Empty
fixDB color l k BBEmpty = Node color l k Empty
-- the sibling is black, and it has one red child
fixDB color a@(Node BB _ _ _) x (Node B (Node R b y c) z d) =
Node color (Node B (makeBlack a) x b) y (Node B c z d)
fixDB color a@(Node BB _ _ _) x (Node B b y (Node R c z d)) =
Node color (Node B (makeBlack a) x b) y (Node B c z d)
fixDB color (Node B a x (Node R b y c)) z d@(Node BB _ _ _) =
Node color (Node B a x b) y (Node B c z (makeBlack d))
fixDB color (Node B (Node R a x b) y c) z d@(Node BB _ _ _) =
Node color (Node B a x b) y (Node B c z (makeBlack d))
-- the sibling and its 2 children are all black, propagate the blackness up
fixDB color a@(Node BB _ _ _) x (Node B b@(Node B _ _ _) y c@(Node B _ _ _))
= makeBlack (Node color (makeBlack a) x (Node R b y c))
fixDB color (Node B a@(Node B _ _ _) x b@(Node B _ _ _)) y c@(Node BB _ _ _)
= makeBlack (Node color (Node R a x b) y (makeBlack c))
-- the sibling is red
fixDB B a@(Node BB _ _ _) x (Node R b y c) = fixDB B (fixDB R a x b) y c
fixDB B (Node R a x b) y c@(Node BB _ _ _) = fixDB B a x (fixDB R b y c)
-- otherwise
fixDB color l k r = Node color l k r
The deletion algorithm takes O(lg N) time to delete a key from a red-black
tree with N nodes.
Exercise 3.4
As we mentioned in this section, deletion can be implemented by just
marking the node as deleted without actually removing it. Once the num-
ber of marked nodes exceeds 50% of the total node number, a tree re-build
is performed. Try to implement this method in your favorite programming
language.
3.5. IMPERATIVE RED-BLACK TREE ALGORITHM 77
3.5 Imperative red-black tree algorithm
We almost nished all the content in this chapter. By induction the patterns, we
can implement the red-black tree in a simple way compare to the imperative tree
rotation solution. However, we should show the comparator for completeness.
For insertion, the basic idea is to use the similar algorithm as described in
binary search tree. And then x the balance problem by rotation and return
the nal result.
1: function Insert(T, k)
2: root T
3: x Create-Leaf(k)
4: Color(x) RED
5: parent NIL
6: while T = NIL do
7: parent T
8: if k < Key(T) then
9: T Left(T)
10: else
11: T Right(T)
12: Parent(x) parent
13: if parent = NIL then tree T is empty
14: return x
15: else if k < Key(parent) then
16: Left(parent) x
17: else
18: Right(parent) x
19: return Insert-Fix(root, x)
The only dierence from the binary search tree insertion algorithm is that
we set the color of the new node as red, and perform xing before return. It is
easy to translate the pseudo code to real imperative programming language, for
instance Python
3
.
def rb_insert(t, key):
root = t
x = Node(key)
parent = None
while(t):
parent = t
if(key < t.key):
t = t.left
else:
t = t.right
if parent is None: #tree is empty
root = x
elif key < parent.key:
parent.set_left(x)
else:
parent.set_right(x)
return rb_insert_fix(root, x)
3
C, and C++ source codes are available along with this book
78CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
There are 3 base cases for xing, and if we take the left-right symmetric
into consideration. there are total 6 cases. Among them two cases can be
merged together, because they all have uncle node in red color, we can toggle
the parent color and uncle color to black and set grand parent color to red.
With this merging, the xing algorithm can be realized as the following.
1: function Insert-Fix(T, x)
2: while Parent(x) = NIL and Color(Parent(x)) = RED do
3: if Color(Uncle(x)) = RED then Case 1, xs uncle is red
4: Color(Parent(x)) BLACK
5: Color(Grand-Parent(x)) RED
6: Color(Uncle(x)) BLACK
7: x Grandparent(x)
8: else xs uncle is black
9: if Parent(x) = Left(Grand-Parent(x)) then
10: if x = Right(Parent(x)) then Case 2, x is a right child
11: x Parent(x)
12: T Left-Rotate(T, x)
Case 3, x is a left child
13: Color(Parent(x)) BLACK
14: Color(Grand-Parent(x)) RED
15: T Right-Rotate(T, Grand-Parent(x))
16: else
17: if x = Left(Parent(x)) then Case 2, Symmetric
18: x Parent(x)
19: T Right-Rotate(T, x)
Case 3, Symmetric
20: Color(Parent(x)) BLACK
21: Color(Grand-Parent(x)) RED
22: T Left-Rotate(T, Grand-Parent(x))
23: Color(T) BLACK
24: return T
This program takes O(lg N) time to insert a new key to the red-black tree.
Compare this pseudo code and the balance function we dened in previous
section, we can see the dierence. They dier not only in terms of simplicity,
but also in logic. Even if we feed the same series of keys to the two algorithms,
they may build dierent red-black trees. There is a bit performance overhead
in the pattern matching algorithm. Okasaki discussed about the dierence in
detail in his paper[2].
Translate the above algorithm to Python yields the below program.
# Fix the redred violation
def rb_insert_fix(t, x):
while(x.parent and x.parent.color==RED):
if x.uncle().color == RED:
#case 1: ((a:R x:R b) y:B c:R) = ((a:R x:B b) y:R c:B)
set_color([x.parent, x.grandparent(), x.uncle()],
[BLACK, RED, BLACK])
x = x.grandparent()
else:
if x.parent == x.grandparent().left:
if x == x.parent.right:
3.6. MORE WORDS 79
#case 2: ((a x:R b:R) y:B c) = case 3
x = x.parent
t=left_rotate(t, x)
# case 3: ((a:R x:R b) y:B c) = (a:R x:B (b y:R c))
set_color([x.parent, x.grandparent()], [BLACK, RED])
t=right_rotate(t, x.grandparent())
else:
if x == x.parent.left:
#case 2: (a x:B (b:R y:R c)) = case 3
x = x.parent
t = right_rotate(t, x)
# case 3: (a x:B (b y:R c:R)) = ((a x:R b) y:B c:R)
set_color([x.parent, x.grandparent()], [BLACK, RED])
t=left_rotate(t, x.grandparent())
t.color = BLACK
return t
Figure 3.12 shows the results of feeding same series of keys to the above
python insertion program. Compare them with gure 3.6, one can tell the
dierence clearly.
11
2 14
1 7
5 8
15
(a)
5
2 7
1 4
3
6 9
8
(b)
Figure 3.12: Red-black trees created by imperative algorithm.
We skip the red-black tree delete algorithm in imperative settings, because
it is even more complex than the insertion. The implementation of deleting is
left as an exercise of this chapter.
Exercise 3.5
Implement the red-black tree deleting algorithm in your favorite impera-
tive programming language. you can refer to [2] for algorithm details.
3.6 More words
Red-black tree is the most popular implementation of balanced binary search
tree. Another one is the AVL tree, which well introduce in next chapter. Red-
black tree can be a good start point for more data structures. If we extend the
80CHAPTER 3. RED-BLACK TREE, NOT SO COMPLEX AS IT WAS THOUGHT
number of children from 2 to K, and keep the balance as well, it leads to B-
tree, If we store the data along with edge but not inside node, it leads to Tries.
However, the multiple cases handling and the long program tends to make new
comers think red-black tree is complex.
Okasakis work helps making the red-black tree much easily understand.
There are many implementation in other programming languages in that manner
[7]. Its also inspired me to nd the pattern matching solution for Splay tree
and AVL tree etc.
Bibliography
[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Cliord
Stein. Introduction to Algorithms, Second Edition. ISBN:0262032937.
The MIT Press. 2001
[2] Chris Okasaki. FUNCTIONAL PEARLS Red-Black Trees in a Functional
Setting. J. Functional Programming. 1998
[3] Chris Okasaki. Ten Years of Purely Functional Data Structures.
http://okasaki.blogspot.com/2008/02/ten-years-of-purely-functional-
data.html
[4] Wikipedia. Red-black tree. http://en.wikipedia.org/wiki/Red-black tree
[5] Lyn Turbak. Red-Black Trees. cs.wellesley.edu/ cs231/fall01/red-
black.pdf Nov. 2, 2001.
[6] SGI STL. http://www.sgi.com/tech/stl/
[7] Pattern matching. http://rosettacode.org/wiki/Pattern matching
AVL tree
Larry LIU Xinyu Larry LIU Xinyu
Email: [email protected]
81
82 AVL tree
Chapter 4
AVL tree
4.1 Introduction
4.1.1 How to measure the balance of a tree?
Besides red-black tree, are there any other intuitive solutions of self-balancing
binary search tree? In order to measure how balancing a binary search tree,
one idea is to compare the height of the left sub-tree and right sub-tree. If
they diers a lot, the tree isnt well balanced. Lets denote the dierence height
between two children as below
(T) = |L| |R| (4.1)
Where |T| means the height of tree T, and L, R denotes the left sub-tree
and right sub-tree.
If (T) = 0, The tree is denitely balanced. For example, a complete binary
tree has N = 2
h
1 nodes for height h. There is no empty branches unless the
leafs. Another trivial case is empty tree. () = 0. The less absolute value of
(T) the more balancing the tree is.
We dene (T) as the balance factor of a binary search tree.
4.2 Denition of AVL tree
An AVL tree is a special binary search tree, that all sub-trees satisfying the
following criteria.
|(T)| 1 (4.2)
The absolute value of balance factor is less than or equal to 1, which means
there are only three valid values, -1, 0 and 1. Figure 4.1 shows an example AVL
tree.
Why AVL tree can keep the tree balanced? In other words, Can this de-
nition ensure the height of the tree as O(lg N) where N is the number of the
nodes in the tree? Lets prove this fact.
For an AVL tree of height h, The number of nodes varies. It can have at
most 2
h
1 nodes for a complete binary tree. We are interesting about how
83
84 CHAPTER 4. AVL TREE
4
2 8
1 3 6 9
5 7 10
Figure 4.1: An example AVL tree
many nodes there are at least. Lets denote the minimum number of nodes for
height h AVL tree as N(h). Its obvious for the trivial cases as below.
For empty tree, h = 0, N(0) = 0;
For a singleton root, h = 1, N(1) = 1;
Whats the situation for common case N(h)? Figure 4.2 shows an AVL tree
T of height h. It contains three part, the root node, and two sub trees A, B.
We have the following fact.
h = max(height(L), height(R)) + 1 (4.3)
We immediately know that, there must be one child has height h 1. Lets
say height(A) = h 1. According to the denition of AVL tree, we have.
|height(A) height(B)| 1. This leads to the fact that the height of other
tree B cant be lower than h 2, So the total number of the nodes of T is the
number of nodes in tree A, and B plus 1 (for the root node). We exclaim that.
N(h) = N(h 1) +N(h 2) + 1 (4.4)
k
h-1 h-2
Figure 4.2: An AVL tree with height h, one of the sub-tree with height h 1,
the other is h 2
4.2. DEFINITION OF AVL TREE 85
This recursion reminds us the famous Fibonacci series. Actually we can
transform it to Fibonacci series by dening N
(h) = N
(h 1) +N
(h 2) (4.5)
Lemma 4.2.1. Let N(h) be the minimum number of nodes for an AVL tree
with height h. and N
(h)
h
(4.6)
Where =
5+1
2
is the golden ratio.
Proof. For the trivial case, we have
h = 0, N
(0) = 1
0
= 1
h = 1, N
(1) = 2
1
= 1.618...
For the induction case, suppose N
(h)
h
.
N
(h + 1) = N
(h) +N
(h 1) {Fibonacci}
h
+
h1
=
h1
( + 1) { + 1 =
2
=
5+3
2
}
=
h+1
From Lemma 4.2.1, we immediately get
h log
(N + 1) = log
, H). Where T
, H
l
), we need do balancing adjustment as well as updating the increment
of height. Function tree() is dened to dealing with this task. It takes 4 pa-
rameters as (L
, H
l
), Key, (R
, H
r
), and . The result of this function is
dened as (T
, H), where T
| |T| (4.10)
This can be further detailed deduced in 4 cases.
H = |T
| |T|
= 1 +max(|R
|, |L
|) (1 +max(|R|, |L|))
= max(|R
|, |L
|) max(|R|, |L|)
=
_
_
H
r
: 0
0
+ H
r
: 0
0
H
l
: 0
0
H
l
: otherwise
(4.11)
To prove this equation, note the fact that the height cant increase both in
left and right with only one insertion.
These 4 cases can be explained from the denition of balance factor denition
that it equal to the dierence from the right sub tree and left sub tree.
If 0 and
|, |L
|) max(|R|, |L|) { 0
0}
= |R
| |L
| {|L| = |L
|}
= |R| + H
r
|L|
= + H
r
For the case 0
|, |L
|) max(|R|, |L|) { 0
0}
= |L
| |R|
= |L| + H
l
|R|
= H
l
= |R
| |L
|
= |R| + H
r
(|L| + H
l
)
= |R| |L| + H
r
H
l
= + H
r
H
l
(4.12)
With all these changes in height and balancing factor get clear, its possible
to dene the tree() function mentioned in (4.9).
tree((L
, H
l
), Key, (R
, H
r
), ) = balance(node(L
, Key, R
), H)
(4.13)
Before we moving into details of balancing adjustment, lets translate the
above equations to real programs in Haskell.
First is the insert function.
insert::(Ord a)AVLTree a a AVLTree a
insert t x = fst $ ins t where
ins Empty = (Br Empty x Empty 0, 1)
ins (Br l k r d)
| x < k = tree (ins l) k (r, 0) d
| x == k = (Br l k r d, 0)
| otherwise = tree (l, 0) k (ins r) d
Here we also handle the case that inserting a duplicated key (which means
the key has already existed.) as just overwriting.
tree::(AVLTree a, Int) a (AVLTree a, Int) Int (AVLTree a, Int)
tree (l, dl) k (r, dr) d = balance (Br l k r d, delta) where
d = d + dr - dl
delta = deltaH d d dl dr
And the denition of height increment is as below.
deltaH :: Int Int Int Int Int
deltaH d d dl dr
| d 0 && d 0 = dr
| d 0 && d 0 = d+dr
| d 0 && d 0 = dl - d
| otherwise = dl
4.3.1 Balancing adjustment
As the pattern matching approach is adopted in doing re-balancing. We need
consider what kind of patterns violate the AVL tree property.
Figure 4.3 shows the 4 cases which need x. For all these 4 cases the bal-
ancing factors are either -2, or +2 which exceed the range of [1, 1]. After
balancing adjustment, this factor turns to be 0, which means the height of left
sub tree is equal to the right sub tree.
We call these four cases left-left lean, right-right lean, right-left lean, and left-
right lean cases in clock-wise direction from top-left. We denote the balancing
4.3. INSERTION 89
@
@
@R
@
@
@I
(z) = 2
(y) = 1
(x) = 2
(y) = 1
(z) = 2
(x) = 1
(x) = 2
(z) = 1
(y) = 0
z
y D
x C
A B
z
x D
A y
B C
x
A y
B z
C D
x
A z
y D
B C
y
x z
A B C D
Figure 4.3: 4 cases for balancing a AVL tree after insertion
90 CHAPTER 4. AVL TREE
factor before xing as (x), (y), and (z), while after xing, they changes to
(x),
(y), and
(z) respectively.
Well next prove that, after xing, we have (y) = 0 for all four cases, and
well provide the result values of
(x) and
(z).
Left-left lean case
As the structure of sub tree x doesnt change due to xing, we immediately get
(x) = (x).
Since (y) = 1 and (z) = 2, we have
(y) = |C| |x| = 1 |C| = |x| 1
(z) = |D| |y| = 2 |D| = |y| 2
(4.14)
After xing.
(x) = (x)
(y) = 0
(z) = 0
(4.17)
Right-right lean case
Since right-right case is symmetric to left-left case, we can easily achieve the
result balancing factors as
(x) = 0
(y) = 0
(z) = (z)
(4.18)
Right-left lean case
First lets consider
(x) = 1
(4.23)
If (y) = 1, it means max(|B|, |C|) = |B|, taking this into (4.21), yields.
|B| |A| = 0 {By (4.19)}
(x) = 0
(4.24)
Summarize these 2 cases, we get relationship of
(x) =
_
1 : (y) = 1
0 : otherwise
(4.25)
For
(z) = 1.
If (y) = 1, then max(|B|, |C|) = |C|, we get
(z) = 0.
Combined these two cases, the relationship between
(z) =
_
1 : (y) = 1
0 : otherwise
(4.27)
Finally, for
(y) = 0.
If (y) = 1, From (4.27), we have
(y) = 0.
Collect all the above results, we get the new balancing factors after xing as
the following.
(x) =
_
1 : (y) = 1
0 : otherwise
(y) = 0
(z) =
_
1 : (y) = 1
0 : otherwise
(4.29)
Left-right lean case
Left-right lean case is symmetric to the Right-left lean case. By using the similar
deduction, we can nd the new balancing factors are identical to the result in
(4.29).
4.3.2 Pattern Matching
All the problems have been solved and its time to dene the nal pattern
matching xing function.
balance(T, H) =
_
_
(node(node(A, x, B, (x)), y, node(C, z, D, 0), 0), 0) : P
ll
(T)
(node(node(A, x, B, 0), y, node(C, z, D, (z)), 0), 0) : P
rr
(T)
(node(node(A, x, B,
(x)), y, node(C, z, D,
(z)), 0), 0) : P
rl
(T) P
lr
(T)
(T, H) : otherwise
(4.30)
Where P
ll
(T) means the pattern of tree T is left-left lean respectively.
(x)
and delta
(z) are dened in (4.29). The four patterns are tested as below.
P
ll
(T) = node(node(node(A, x, B, (x)), y, C, 1), z, D, 2)
P
rr
(T) = node(A, x, node(B, y, node(C, z, D, (z)), 1), 2)
P
rl
(T) = node(node(A, x, node(B, y, C, (y)), 1), z, D, 2)
P
lr
(T) = node(A, x, node(node(B, y, C, (y)), z, D, 1), 2)
(4.31)
4.3. INSERTION 93
Translating the above function denition to Haskell yields a simple and in-
tuitive program.
balance :: (AVLTree a, Int) (AVLTree a, Int)
balance (Br (Br (Br a x b dx) y c (-1)) z d (-2), _) =
(Br (Br a x b dx) y (Br c z d 0) 0, 0)
balance (Br a x (Br b y (Br c z d dz) 1) 2, _) =
(Br (Br a x b 0) y (Br c z d dz) 0, 0)
balance (Br (Br a x (Br b y c dy) 1) z d (-2), _) =
(Br (Br a x b dx) y (Br c z d dz) 0, 0) where
dx = if dy == 1 then -1 else 0
dz = if dy == -1 then 1 else 0
balance (Br a x (Br (Br b y c dy) z d (-1)) 2, _) =
(Br (Br a x b dx) y (Br c z d dz) 0, 0) where
dx = if dy == 1 then -1 else 0
dz = if dy == -1 then 1 else 0
balance (t, d) = (t, d)
The insertion algorithm takes time proportion to the height of the tree, and
according to the result we proved above, its performance is O(lg N) where N is
the number of elements stored in the AVL tree.
Verication
One can easily create a function to verify a tree is AVL tree. Actually we need
verify two things, rst, its a binary search tree; second, it satises AVL tree
property.
We left the rst verication problem as an exercise to the reader.
In order to test if a binary tree satises AVL tree property, we can test
the dierence in height between its two children, and recursively test that both
children conform to AVL property until we arrive at an empty leaf.
avl?(T) =
_
True : T =
avl?(L) avl?(R) ||R| |L|| 1 : otherwise
(4.32)
And the height of a AVL tree can also be calculate from the denition.
|T| =
_
0 : T =
1 +max(|R|, |L|) : otherwise
(4.33)
The corresponding Haskell program is given as the following.
isAVL :: (AVLTree a) Bool
isAVL Empty = True
isAVL (Br l _ r d) = and [isAVL l, isAVL r, abs (height r - height l) 1]
height :: (AVLTree a) Int
height Empty = 0
height (Br l _ r _) = 1 + max (height l) (height r)
Exercise 4.1
Write a program to verify a binary tree is a binary search tree in your
favorite programming language. If you choose to use an imperative language,
please consider realize this program without recursion.
94 CHAPTER 4. AVL TREE
4.4 Deletion
As we mentioned before, deletion doesnt make signicant sense in purely func-
tional settings. As the tree is read only, its typically performs frequently looking
up after build.
Even if we implement deletion, its actually re-building the tree as we pre-
sented in chapter of red-black tree. We left the deletion of AVL tree as an
exercise to the reader.
Exercise 4.2
Take red-black tree deletion algorithm as an example, write the AVL tree
deletion program in purely functional approach in your favorite program-
ming language.
Write the deletion algorithm in imperative approach in your favorite pro-
gramming language.
4.5 Imperative AVL tree algorithm
We almost nished all the content in this chapter about AVL tree. However, it
necessary to show the traditional insert-and-rotate approach as the comparator
to pattern matching algorithm.
Similar as the imperative red-black tree algorithm, the strategy is rst to do
the insertion as same as for binary search tree, then x the balance problem by
rotation and return the nal result.
1: function Insert(T, k)
2: root T
3: x Create-Leaf(k)
4: (x) 0
5: parent NIL
6: while T = NIL do
7: parent T
8: if k < Key(T) then
9: T Left(T)
10: else
11: T Right(T)
12: Parent(x) parent
13: if parent = NIL then tree T is empty
14: return x
15: else if k < Key(parent) then
16: Left(parent) x
17: else
18: Right(parent) x
19: return AVL-Insert-Fix(root, x)
Note that after insertion, the height of the tree may increase, so that the
balancing factor may also change, insert on right side will increase by 1,
4.5. IMPERATIVE AVL TREE ALGORITHM 95
while insert on left side will decrease it. By the end of this algorithm, we need
perform bottom-up xing from node x towards root.
We can translate the pseudo code to real programming language, such as
Python
2
.
def avl_insert(t, key):
root = t
x = Node(key)
parent = None
while(t):
parent = t
if(key < t.key):
t = t.left
else:
t = t.right
if parent is None: #tree is empty
root = x
elif key < parent.key:
parent.set_left(x)
else:
parent.set_right(x)
return avl_insert_fix(root, x)
This is a top-down algorithm search the tree from root down to the proper
position and insert the new key as a leaf. By the end of this algorithm, it calls
xing procedure, by passing the root and the new node inserted.
Note that we reuse the same methods of set left() and set right() as we
dened in chapter of red-black tree.
In order to resume the AVL tree balance property by xing, we rst deter-
mine if the new node is inserted on left hand or right hand. If it is on left, the
balancing factor decreases, otherwise it increases. If we denote the new value
as
.
If || = 1 and |
1
6: else
7:
+ 1
8: (Parent(x))
9: P Parent(x)
2
C and C++ source code are available along with this book
96 CHAPTER 4. AVL TREE
10: L Left(x)
11: R Right(x)
12: if || = 1 and |
| = 2 then
17: if
= 2 then
18: if (R) = 1 then Right-right case
19: (P) 0 By (4.18)
20: (R) 0
21: T Left-Rotate(T, P)
22: if (R) = 1 then Right-left case
23:
y
(Left(R)) By (4.29)
24: if
y
= 1 then
25: (P) 1
26: else
27: (P) 0
28: (Left(R)) 0
29: if
y
= 1 then
30: (R) 1
31: else
32: (R) 0
33: T Right-Rotate(T, R)
34: T Left-Rotate(T, P)
35: if
= 2 then
36: if (L) = 1 then Left-left case
37: (P) 0
38: (L) 0
39: Right-Rotate(T, P)
40: else Left-Right case
41:
y
(Right(L))
42: if
y
= 1 then
43: (L) 1
44: else
45: (L) 0
46: (Right(L)) 0
47: if
y
= 1 then
48: (P) 1
49: else
50: (P) 0
51: Left-Rotate(T, L)
52: Right-Rotate(T, P)
53: break
54: return T
Here we reuse the rotation algorithms mentioned in red-black tree chapter.
Rotation operation doesnt update balancing factor at all, However, since
rotation changes (actually improves) the balance situation we should update
4.5. IMPERATIVE AVL TREE ALGORITHM 97
these factors. Here we refer the results from above section. Among the four
cases, right-right case and left-left case only need one rotation, while right-left
case and left-right case need two rotations.
The relative python program is shown as the following.
def avl_insert_fix(t, x):
while x.parent is not None:
d2 = d1 = x.parent.delta
if x == x.parent.left:
d2 = d2 - 1
else:
d2 = d2 + 1
x.parent.delta = d2
(p, l, r) = (x.parent, x.parent.left, x.parent.right)
if abs(d1) == 1 and abs(d2) == 0:
return t
elif abs(d1) == 0 and abs(d2) == 1:
x = x.parent
elif abs(d1)==1 and abs(d2) == 2:
if d2 == 2:
if r.delta == 1: # Right-right case
p.delta = 0
r.delta = 0
t = left_rotate(t, p)
if r.delta == -1: # Right-Left case
dy = r.left.delta
if dy == 1:
p.delta = -1
else:
p.delta = 0
r.left.delta = 0
if dy == -1:
r.delta = 1
else:
r.delta = 0
t = right_rotate(t, r)
t = left_rotate(t, p)
if d2 == -2:
if l.delta == -1: # Left-left case
p.delta = 0
l.delta = 0
t = right_rotate(t, p)
if l.delta == 1: # Left-right case
dy = l.right.delta
if dy == 1:
l.delta = -1
else:
l.delta = 0
l.right.delta = 0
if dy == -1:
p.delta = 1
else:
p.delta = 0
t = left_rotate(t, l)
98 CHAPTER 4. AVL TREE
t = right_rotate(t, p)
break
return t
We skip the AVL tree deletion algorithm and left this as an exercise to the
reader.
4.6 Chapter note
AVL tree was invented in 1962 by Adelson-Velskii and Landis[3], [4]. The name
AVL tree comes from the two inventors name. Its earlier than red-black tree.
Its very common to compare AVL tree and red-black tree, both are self-
balancing binary search trees, and for all the major operations, they both con-
sume O(lg N) time. From the result of (4.7), AVL tree is more rigidly balanced
hence they are faster than red-black tree in looking up intensive applications
[3]. However, red-black trees could perform better in frequently insertion and
removal cases.
Many popular self-balancing binary search tree libraries are implemented on
top of red-black tree such as STL etc. However, AVL tree provides an intuitive
and eective solution to the balance problem as well.
After this chapter, well extend the tree data structure from storing data in
node to storing information on edges, which leads to Trie and Patrica, etc. If
we extend the number of children from two to more, we can get B-tree. These
data structures will be introduced next.
Bibliography
[1] Data.Tree.AVL http://hackage.haskell.org/packages/archive/AvlTree/4.2/doc/html/Data-
Tree-AVL.html
[2] Chris Okasaki. FUNCTIONAL PEARLS Red-Black Trees in a Functional
Setting. J. Functional Programming. 1998
[3] Wikipedia. AVL tree. http://en.wikipedia.org/wiki/AVL tree
[4] Guy Cousinear, Michel Mauny. The Functional Approach to Program-
ming. Cambridge University Press; English Ed edition (October 29, 1998).
ISBN-13: 978-0521576819
[5] Pavel Grafov. Implementation of an AVL tree in Python.
http://github.com/pgrafov/python-avl-tree
Trie and Patricia with Functional and imperative implementation
Larry LIU Xinyu Larry LIU Xinyu
Email: [email protected]
99
100 Trie and Patricia
Chapter 5
Trie and Patricia with
Functional and imperative
implementation
5.1 abstract
Trie and Patricia are important data structures in information retrieving and
manipulating. None of these data structures are new. They were invented in
1960s. This post collects some existing knowledge about them. Some functional
and imperative implementation are given in order to show the basic idea of these
data structures. There are multiple programming languages used, including,
C++, Haskell, python and scheme/lisp. C++ and python are mostly used to
show the imperative implementation, while Haskell and Scheme are used for
functional purpose.
There may be mistakes in the post, please feel free to point out.
This post is generated by L
A
T
E
X2
(LEFT(T), x/2)
8: else
9: return INT TRIE LOOKUP
(RIGHT(T), x/2)
Look up implemented in Haskell
In Haskell, we can use pattern matching to realize the above long if-then-else
statements. The program is as the following.
search :: IntTrie a Key Maybe a
search Empty k = Nothing
search t 0 = value t
search t k = if even k then search (left t) (k div 2)
else search (right t) (k div 2)
If trie is empty, we simply returns nothing; if key is zero we return the value
of current node; in other case we recursively search either left child or right child
according to the LSB is 0 or not.
To test this program, we can write a smoke test case as following.
testIntTrie = "t=" ++ (toString t) ++
"nsearch t 4: " ++ (show $ search t 4) ++
"nsearch t 0: " ++ (show $ search t 0)
where
t = fromList [(1, a), (4, b), (5, c), (9, d)]
main = do
putStrLn testIntTrie
This program will output these result.
114CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
t=(((. 0 (. 4:b .)) 0 .) 0 (((. 1 (. 9:d .)) 1 (. 5:c .)) 1:a .))
search t 4: Just b
search t 0: Nothing
Look up implemented in Scheme/Lisp
Scheme/Lisp implementation is quite similar. Note that we decrees key by 1
before divide it with 2.
(define (lookup t k)
(if (null? t) ()
(if (= k 0) (value t)
(if (even? k)
(lookup (left t) (/ k 2))
(lookup (right t) (/ (- k 1) 2))))))
Test cases use the same Trie which is created in insertion section.
(define (test-int-trie)
(define t (listtrie (list (1 "a") (4 "b") (5 "c") (9 "d"))))
(display (triestring t)) (newline)
(display "lookup4:") (display (lookup t 4)) (newline)
(display "lookup0:") (display (lookup t 0)) (newline))
the result is as same as the one output by Haskell program.
(test-int-trie)
(((. 0. (. 4b )) 0. ) 0. (((. 1. (. 9d )) 1. (. 5c )) 1a ))
lookup 4: b
lookup 0: ()
5.4 Integer Patricia Tree
Its very easy to nd the drawbacks of integer binary trie. Trie wasts a lot of
spaces. Note in gure 5.3, all nodes except leafs store the real data. Typically,
an integer binary trie contains many nodes only have one child. It is very easy
to come to the idea for improvement, to compress the chained nodes which
have only one child. Patricia is such a data structure invented by Donald R.
Morrison in 1968. Patricia means practical algorithm to retrieve information
coded in alphanumeric[3]. Wikipedia redirect Patricia as Radix tree.
Chris Okasaki gave his implementation of Integer Patricia tree in paper [2].
If we merge the chained nodes which have only one child together in gure 5.3,
We can get a patricia as shown in gure 5.4.
From this gure, we can found that the keys of sibling nodes have the longest
common prex. They only branches out at certain bit. It means that we can
save a lot of data by storing the common prex.
Dierent from integer Trie, using big-endian integer in Patricia doesnt cause
the problem mentioned in section 5.3. Because all zero bits before MSB can be
just omitted to save space. Big-endian integer is more natural than little-endian
integer. Chris Okasaki list some signicant advantages of big-endian Patricia
trees [2].
5.4. INTEGER PATRICIA TREE 115
4:b
001
1:a
1
0
9:d
01
5:c
1
Figure 5.4: Little endian patricia for the map {1 a, 4 b, 5 c, 9 d}.
5.4.1 Denition of Integer Patricia tree
Integer Patricia tree is a special kind of binary tree, it is
either a leaf node contains an integer key and a value
or a branch node, contains a left child and a right child. The integer keys
of two children shares the longest common prex bits, the next bit of the
left childs key is zero while it is one for right childs key.
Denition of big-endian integer Patricia tree in Haskell
If we translate the above recursive denition to Haskell, we can get below Integer
Patrica Tree code.
data IntTree a = Empty
| Leaf Key a
| Branch Prefix Mask (IntTree a) (IntTree a) -- prefix, mask, left, right
type Key = Int
type Prefix = Int
type Mask = Int
In order to tell from which bit the left and right children dier, a mask is
recorded by the branch node. Typically, a mask is 2
n
, all lower bits than n
doesnt belong to common prex
Denition of big-endian integer Patricia tree in Python
Such denition can be represent in Python similarly. Some helper functions are
provided for easy operation later on.
class IntTree:
def __init__(self, key = None, value = None):
self.key = key
self.value = value
self.prefix = self.mask = None
116CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
self.left = self.right = None
def set_children(self, l, r):
self.left = l
self.right = r
def replace_child(self, x, y):
if self.left == x:
self.left = y
else:
self.right = y
def is_leaf(self):
return self.left is None and self.right is None
def get_prefix(self):
if self.prefix is None:
return self.key
else:
return self.prefix
Some helper member functions are provided in this denition. When Initial-
ized, prex, mask and children are all set to invalid value. Note the get prex()
function, in case the prex hasnt been initialized, which means it is a leaf node,
the key itself is returned.
Denition of big-endian integer Patricia tree in C++
With ISO C++, the type of the data stored in Patricia can be abstracted as a
template parameter. The denition is similar to the python version.
template<class T>
struct IntPatricia{
IntPatricia(int k=0, T v=T()):
key(k), value(v), prefix(k), mask(1), left(0), right(0){}
~IntPatricia(){
delete left;
delete right;
}
bool is_leaf(){
return left == 0 && right == 0;
}
bool match(int x){
return (!is_leaf()) && (maskbit(x, mask) == prefix);
}
void replace_child(IntPatricia<T> x, IntPatricia<T> y){
if(left == x)
left = y;
else
right = y;
}
5.4. INTEGER PATRICIA TREE 117
void set_children(IntPatricia<T> l, IntPatricia<T> r){
left = l;
right = r;
}
int key;
T value;
int prefix;
int mask;
IntPatricia left;
IntPatricia right;
};
In order to release the memory easily, the program just recursively deletes
the children in destructor. The default value of type T is used for initialization.
The prex is initialized to be the same value as key.
For the member function match(), Ill explain it in later part.
Denition of big-endian integer Patricia tree in Scheme/Lisp
In Scheme/Lisp program, the data structure behind is still list, we provide
creator functions and accessors to create Patricia and access the children, key,
value, prex and mask.
(define (make-leaf k v) ;; key and value
(list k v))
(define (make-branch p m l r) ;; prefix, mask, left and right
(list p m l r))
;; Helpers
(define (leaf? t)
(= (length t) 2))
(define (branch? t)
(= (length t) 4))
(define (key t)
(if (leaf? t) (car t) ()))
(define (value t)
(if (leaf? t) (cadr t) ()))
(define (prefix t)
(if (branch? t) (car t) ()))
(define (mask t)
(if (branch? t) (cadr t) ()))
(define (left t)
(if (branch? t) (caddr t) ()))
(define (right t)
(if (branch? t) (cadddr t) ()))
118CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Function key and value are only applicable to leaf node while prex, mask,
children accessors are only applicable to branch node. So we test the node type
in these functions.
5.4.2 Insertion of Integer Patricia tree
When insert a key into a integer Patricia tree, if the tree is empty, we can just
create a leaf node with the given key and data. (as shown in gure 5.5).
NIL 12
Figure 5.5: (a). Insert key 12 to an empty patricia tree.
If the tree only contains a leaf node x, we can create a branch, put the new
key and data as a leaf y of the branch. To determine if the new leaf y should
be left node or right node. We need nd the longest common prex of x and y,
for example if key(x) is 12 (1100 in binary), key(y) is 15 (1111 in binary), then
the longest common prex is 11oo. The o denotes the bits we dont care about.
we can use an integer to mask the those bits. In this case, the mask number
is 4 (100 in binary). The next bit after the prex presents 2
1
. Its 0 in key(x),
while it is 1 in key(y). So we put x as left child and y as right child. Figure 5.6
shows this case.
12
prefix=1100
mask=100
12
0
15
1
Figure 5.6: (b). Insert key 15 to the result tree in (a).
If the tree is neither empty, nor a leaf node, we need rstly check if the key
to be inserted matches common prex with root node. If it does, then we can
recursively insert the key to the left child or right child according to the next
bit. For instance, if we want to insert key 14 (1110 in binary) to the result tree
in gure 5.6, since it has common prex 11oo, and the next bit (the bit of 2
1
) is
5.4. INTEGER PATRICIA TREE 119
1, so we tried to insert 14 to the right child. Otherwise, if the key to be inserted
doesnt match the common prex with the root node, we need branch a new
leaf node. Figure 5.7 shows these 2 dierent cases.
prefix=1100
mask=100
12
0
15
1
prefix=1100
mask=100
12
0
prefix=1110
mask=10
1
14
0
15
1
prefix=1100
mask=100
12
0
15
1
prefix=0
mask=10000
prefix=1110
mask=10
1
5
0
12
0
15
1
Figure 5.7: (c). Insert key 14 to the result tree in (b); (d). Insert key 5 to the
result tree in (b).
Iterative insertion algorithm for integer Patricia
Summarize the above cases, the insertion of integer patricia can be described
with the following algorithm.
1: function INT-PATRICIA-INSERT(T, x, data)
2: if T = NIL then
3: T CREATE LEAF(x, data)
4: return T
5: y T
6: p NIL
7: while y is not LEAF and MATCH(x, PREFIX(y), MASK(y)) do
8: p y
9: if ZERO(x, MASK(y)) = TRUE then
10: y LEFT(y)
11: else
12: y RIGHT(y)
13: if LEAF(y) = TRUE and x = KEY (y) then
14: DATA(y) data
15: else
16: z BRANCH(y, CREATE LEAF(x, data))
17: if p = NIL then
18: T z
19: else
20: if LEFT(p) = y then
21: LEFT(p) z
22: else
23: RIGHT(p) z
24: return T
120CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
In the above algorithm, MATCH procedure test if an integer key x, has the
same prex of node y above the mask bit. For instance, Suppose the prex
of node y can be denoted as p(n), p(n 1), ..., p(i), ..., p(0) in binary, key x is
k(n), k(n 1), ..., k(i), ..., k(0), and mask of node y is 100...0 = 2
i
, if and only if
p(j) = k(j) for all i j n, we say the key matches.
Insertion of big-endian integer Patricia tree in Python
Based on the above algorithm, the main insertion program can be realized as
the following.
def insert(t, key, value = None):
if t is None:
t = IntTree(key, value)
return t
node = t
parent = None
while(True):
if match(key, node):
parent = node
if zero(key, node.mask):
node = node.left
else:
node = node.right
else:
if node.is_leaf() and key == node.key:
node.value = value
else:
new_node = branch(node, IntTree(key, value))
if parent is None:
t = new_node
else:
parent.replace_child(node, new_node)
break
return t
The sub procedure of match, branch, lcp etc. are given as below.
def maskbit(x, mask):
return x & (~(mask-1))
def match(key, tree):
if tree.is_leaf():
return False
return maskbit(key, tree.mask) == tree.prefix
def zero(x, mask):
return x & (mask>>1) == 0
def lcp(p1, p2):
diff = (p1 ^ p2)
mask=1
while(diff!=0):
diff>>=1
5.4. INTEGER PATRICIA TREE 121
mask<1
return (maskbit(p1, mask), mask)
def branch(t1, t2):
t = IntTree()
(t.prefix, t.mask) = lcp(t1.get_prefix(), t2.get_prefix())
if zero(t1.get_prefix(), t.mask):
t.set_children(t1, t2)
else:
t.set_children(t2, t1)
return t
Function maskbit() can clear all bits covered by a mask to 0. For instance,
x = 101101(b), and mask = 2
3
= 100(b), the lowest 2 bits will be cleared to
0, which means maskbit(x, mask) = 101100(b). This can be easily done by
bit-wise operation.
Function zero() is used to check if the bit next to mask bit is 0. For instance,
if x = 101101(b), y = 101111(b), and mask = 2
3
= 100(b), zero will check if the
2nd lowest bit is 0. So zero(x, mask) = true, zero(y, mask) = false.
Function lcp can extract the Longest Common Prex of two integer. For
the x and y in above example, because only the last 2 bits are dierent, so
lcp(x, y) = 101100(b). And we set a mask to 2
3
= 100(b) to indicate that the
last 2 bits are not eective for the prex value.
To convert a list or a map into a Patricia tree, we can repeatedly insert the
elements one by one. Since the program is same, except for the insert function,
we can abstract list to xxx and map to xxx to utility functions
# in trieutil.py
def from_list(l, insert_func):
t = None
for x in l:
t = insert_func(t, x)
return t
def from_map(m, insert_func):
t = None
for k, v in m.items():
t = insert_func(t, k, v)
return t
With this high level functions, we can provide list to patricia and map to patricia
as below.
def list_to_patricia(l):
return from_list(l, insert)
def map_to_patricia(m):
return from_map(m, insert)
In order to have smoke test of the above insertion program, some test cases
and output helper are given.
def to_string(t):
to_str = lambda x: "%s" %x
if t is None:
122CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
return ""
if t.is_leaf():
str = to_str(t.key)
if t.value is not None:
str += ":"+to_str(t.value)
return str
str ="["+to_str(t.prefix)+"@"+to_str(t.mask)+"]"
str+="("+to_string(t.left)+","+to_string(t.right)+")"
return str
class IntTreeTest:
def run(self):
self.test_insert()
def test_insert(self):
print "testinsert"
t = list_to_patricia([6])
print to_string(t)
t = list_to_patricia([6, 7])
print to_string(t)
t = map_to_patricia({1:x, 4:y, 5:z})
print to_string(t)
if __name__ == "__main__":
IntTreeTest().run()
The program will output a result as the following.
test insert
6
[6@2](6,7)
[0@8](1:x,[4@2](4:y,5:z))
This result means the program creates a Patrica tree shown in Figure 5.8.
Insertion of big-endian integer Patricia tree in C++
In the below C++ program, the default value of data type is used if user doesnt
provide data. It is nearly strict translation of the pseudo code.
template<class T>
IntPatricia<T> insert(IntPatricia<T> t, int key, T value=T()){
if(!t)
return new IntPatricia<T>(key, value);
IntPatricia<T> node = t;
IntPatricia<T> parent(0);
while( nodeis_leaf()==false && nodematch(key) ){
parent = node;
if(zero(key, nodemask))
node = nodeleft;
else
node = noderight;
}
5.4. INTEGER PATRICIA TREE 123
prefix=0
mask=8
1:x
0
prefix=100
mask=2
1
4:y
0
5:z
1
Figure 5.8: Insert map 1 x, 4 y, 5 z into a big-endian integer Patricia
tree.
124CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
if(nodeis_leaf() && key == nodekey)
nodevalue = value;
else{
IntPatricia<T> p = branch(node, new IntPatricia<T>(key, value));
if(!parent)
return p;
parentreplace_child(node, p);
}
return t;
}
Lets review the implementation of member function match()
bool match(int x){
return (!is_leaf()) && (maskbit(x, mask) == prefix);
}
if a node is not a leaf, and it has common prex (in bit-wise) as the key to be
inserted, we say the node match the key. It is realized by a maskbit() function
as below.
int maskbit(int x, int mask){
return x & (~(mask-1));
}
Since mask is always 2
n
, minus 1 will ip it to 111...1(b), then we reverse
the it by bit-wise not, and clear all the lowest n 1 bits of x by bit-wise and.
The branch() function in above program is as the following.
template<class T>
IntPatricia<T> branch(IntPatricia<T> t1, IntPatricia<T> t2){
IntPatricia<T> t = new IntPatricia<T>();
tmask = lcp(tprefix, t1prefix, t2prefix);
if(zero(t1prefix, tmask))
tset_children(t1, t2);
else
tset_children(t2, t1);
return t;
}
It will extract the Longest Common Prex, and create a new node, put the
2 nodes to be merged as its children. Function lcp() is implemented as below.
int lcp(int& p, int p1, int p2){
int diff = p1^p2;
int mask = 1;
while(diff){
diff>>=1;
mask<1;
}
p = maskbit(p1, mask);
return mask;
}
Because we can only return one value in C++, we set the reference of pa-
rameter p as the common prex result and returns the mask value.
5.4. INTEGER PATRICIA TREE 125
To decide which child is left and which one is right when branching, we need
test if the bit next to mask bit is zero.
bool zero(int x, int mask){
return (x & (mask>>1)) == 0;
}
To verify the C++ program, some simple test cases are provided.
IntPatricia<int> ti(0);
const int lst[] = {6, 7};
ti = std::accumulate(lst, lst+sizeof(lst)/sizeof(int), ti,
std::ptr_fun(insert_key<int>));
std::copy(lst, lst+sizeof(lst)/sizeof(int),
std::ostream_iterator<int>(std::cout, ","));
std::cout<<"="<<patricia_to_str(ti)<<"n";
const int keys[] = {1, 4, 5};
const char vals[] = "xyz";
IntPatricia<char> tc(0);
for(unsigned int i=0; i<sizeof(keys)/sizeof(int); ++i)
tc = insert(tc, keys[i], vals[i]);
std::copy(keys, keys+sizeof(keys)/sizeof(int),
std::ostream_iterator<int>(std::cout, ","));
std::cout<<"="<<patricia_to_str(tc);
To avoid repeating ourselves, we provide a dierent way instead of write a
list to patrica(), which is very similar to list to trie in previous section.
In C++ STL, std::accumulate() plays a similar role of fold-left. But the
functor we provide to accumulate must take 2 parameters, so we provide a
wrapper function as below.
template<class T>
IntPatricia<T> insert_key(IntPatricia<T> t, int key){
return insert(t, key);
}
With all these code line, we can get the following result.
6, 7, ==>[6@2](6,7)
1, 4, 5, ==>[0@8](1:x,[4@2](4:y,5:z))
Recursive insertion algorithm for integer Patricia
To implement insertion in recursive way, we treat the dierent cases separately.
If the tree is empty, we just create a leaf node and return; if the tree is a leaf
node, we need check the key of the node is as same as the key to be inserted,
we overwrite the data in case they are same, else we need branch a new node
and extract the longest common prex and mask bit; In other case, we need
examine if the key as common prex with the branch node, and recursively
perform insertion either to left child or to right child according to the next
dierent bit is 0 or 1; Below recursive algorithm describes this approach.
1: function INT-PATRICIA-INSERT(T, x, data)
2: if T = NIL or (T is a leaf and x = KEY (T)) then
3: return CREATE LEAF(x, data)
126CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
4: else if MATCH(x, PREFIX(T), MASK(T)) then
5: if ZERO(x, MASK(T)) then
6: LEFT(T) INTPATRICIAINSERT
(LEFT(T), x, data)
7: else
8: RIGHT(T) INTPATRICIAINSERT
(RIGHT(T), x, data)
9: return T
10: else
11: return BRANCH(T, CREATE LEAF(x, data))
Insertion of big-endian integer Patricia tree in Haskell
Insertion of big-endian integer Patricia tree can be implemented in Haskell by
Change the above algorithm to recursive approach.
-- usage: insert tree key x
insert :: IntTree a Key a IntTree a
insert t k x
= case t of
Empty Leaf k x
Leaf k x if k==k then Leaf k x
else join k (Leaf k x) k t -- t@(Leaf k x)
Branch p m l r
| match k p m if zero k m
then Branch p m (insert l k x) r
else Branch p m l (insert r k x)
| otherwise join k (Leaf k x) p t -- t@(Branch p m l r)
The match, zero and join functions in this program are dened as below.
-- join 2 nodes together.
-- (prefix1, tree1) ++ (prefix2, tree2)
-- 1. find the longest common prefix == lcp(prefix1, prefix2), where
-- prefix1 = a(n),a(n-1),...a(i+1),a(i),x...
-- prefix2 = a(n),a(n-1),...a(i+1),a(i),y...
-- prefix = a(n),a(n-1),...a(i+1),a(i),00...0
-- 2. mask bit = 100...0b (=2^i)
-- so mask is something like, 1,2,4,...,128,256,...
-- 3. if x==0, y==1 then (tree1left, tree2right),
-- else if x==1, y==0 then (tree2left, tree1right).
join :: Prefix IntTree a Prefix IntTree a IntTree a
join p1 t1 p2 t2 = if zero p1 m then Branch p m t1 t2
else Branch p m t2 t1
where
(p, m) = lcp p1 p2
-- lcp means longest common prefix
lcp :: Prefix Prefix (Prefix, Mask)
lcp p1 p2 = (p, m) where
m = bit (highestBit (p1 xor p2))
p = mask p1 m
-- get the order of highest bit of 1.
-- For a number x = 00...0,1,a(i-1)...a(1)
-- the result is i
5.4. INTEGER PATRICIA TREE 127
highestBit :: Int Int
highestBit x = if x==0 then 0 else 1+highestBit (shiftR x 1)
-- For a number x = a(n),a(n-1)...a(i),a(i-1),...,a(0)
-- and a mask m = 100..0 (=2^i)
-- the result of mask x m is a(n),a(n-1)...a(i),00..0
mask :: Int Mask Int
mask x m = (x &. complement (m-1)) -- complement means bit-wise not.
-- Test if the next bit after mask bit is zero
-- For a number x = a(n),a(n-1)...a(i),1,...a(0)
-- and a mask m = 100..0 (=2^i)
-- because the bit next to a(i) is 1, so the result is False
-- For a number y = a(n),a(n-1)...a(i),0,...a(0) the result is True.
zero :: Int Mask Bool
zero x m = x &. (shiftR m 1) == 0
-- Test if a key matches a prefix above of the mask bit
-- For a prefix: p(n),p(n-1)...p(i)...p(0)
-- a key: k(n),k(n-1)...k(i)...k(0)
-- and a mask: 100..0 = (2^i)
-- If and only if p(j)==k(j), ijn the result is True
match :: Key Prefix Mask Bool
match k p m = (mask k m) == p
In order to test the above insertion program, some test helper functions are
provided.
-- Generate a Int Patricia tree from a list
-- Usage: fromList [(k1, x1), (k2, x2),..., (kn, xn)]
fromList :: [(Key, a)] IntTree a
fromList xs = foldl ins Empty xs where
ins t (k, v) = insert t k v
toString :: (Show a)IntTree a String
toString t =
case t of
Empty "."
Leaf k x (show k) ++ ":" ++ (show x)
Branch p m l r "[" ++ (show p) ++ "@" ++ (show m) ++ "]" ++
"(" ++ (toString l) ++ ", " ++ (toString r) ++ ")"
With these helpers, insertion can be test as the following.
testIntTree = "t=" ++ (toString t)
where
t = fromList [(1, x), (4, y), (5, z)]
main = do
putStrLn testIntTree
This test will output:
t=[0@8](1:x, [4@2](4:y, 5:z))
This result means the program creates a Patrica tree shown in Figure 5.8.
128CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Insertion of big-endian integer Patricia tree in Scheme/Lisp
In Scheme/Lisp, we use switch-case like condition to test if the node is empty,
or a leaf or a branch.
(define (insert t k x) ;; t: patrica, k: key, x: value
(cond ((null? t) (make-leaf k x))
((leaf? t) (if (= (key t) k)
(make-leaf k x) ;; overwrite
(branch k (make-leaf k x) (key t) t)))
((branch? t) (if (match? k (prefix t) (mask t))
(if (zero-bit? k (mask t))
(make-branch (prefix t)
(mask t)
(insert (left t) k x)
(right t))
(make-branch (prefix t)
(mask t)
(left t)
(insert (right t) k x)))
(branch k (make-leaf k x) (prefix t) t)))))
Where the function match?, zero-bit?, and branch are given as the following.
We use the scheme x number bit-wise operations to mask the number and test
bit.
(define (mask-bit x m)
(fix:and x (fix:not (- m 1))))
(define (zero-bit? x m)
(= (fix:and x (fix:lsh m -1)) 0))
(define (lcp x y) ;; get the longest common prefix
(define (count-mask z)
(if (= z 0) 1 ( 2 (count-mask (fix:lsh z -1)))))
(let ((m (count-mask (fix:xor x y)))
(p (mask-bit x m)))
(cons p m)))
(define (match? k p m)
(= (mask-bit k m) p))
(define (branch p1 t1 p2 t2) ;; pi: prefix i, ti: Patricia i
(let ((pm (lcp p1 p2))
(p (car pm))
(m (cdr pm)))
(if (zero-bit? p1 m)
(make-branch p m t1 t2)
(make-branch p m t2 t1))))
We can use the very same list-trie function which is dened in integer trie.
Below is an example to create a integer Patricia tree.
(define (test-int-patricia)
(define t (listtrie (list (1 "x") (4 "y") (5 "z"))))
(display t) (newline))
5.4. INTEGER PATRICIA TREE 129
Evaluate it will generate a Patricia tree like below.
(test-int-patricia)
(0 8 (1 x) (4 2 (4 y) (5 z)))
It is identical to the insert result output by Hasekll insertion program.
5.4.3 Look up in Integer Patricia tree
Consider the property of integer Patricia tree, to look up a key, we test if the
key has common prex with the root, if yes, we then check the next bit diers
from common prex is zero or one. If it is zero, we then do look up in the left
child, else we turn to right.
Iterative looking up in integer Patricia tree
In case we reach a leaf node, we can directly check if the key of the leaf is equal
to what we are looking up. This algorithm can be described with the following
pseudo code.
1: function INT-PATRICIA-LOOK-UP(T, x)
2: if T = NIL then
3: return NIL
4: while T is not LEAF and MATCH(x, PREFIX(T), MASK(T)) do
5: if ZERO(x, MASK(T)) then
6: T LEFT(T)
7: else
8: T RIGHT(T)
9: if T is LEAF and KEY (T) = x then
10: return DATA(T)
11: else
12: return NIL
Look up in big-endian integer Patricia tree in Python
With Python, we can directly translate the pseudo code into valid program.
def lookup(t, key):
if t is None:
return None
while (not t.is_leaf()) and match(key, t):
if zero(key, t.mask):
t = t.left
else:
t = t.right
if t.is_leaf() and t.key == key:
return t.value
else:
return None
We can verify this program by some simple smoke test cases.
print "testlookup"
t = map_to_patricia({1:x, 4:y, 5:z})
print "lookup4:", lookup(t, 4)
print "lookup0:", lookup(t, 0)
130CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
We can get similar output as below.
test look up
look up 4: y
look up 0: None
Look up in big-endian integer Patricia tree in C++
With C++ language, if the program doesnt nd the key, we can either raise
exception to indicate a search failure or return a special value.
template<class T>
T lookup(IntPatricia<T> t, int key){
if(!t)
return T(); //or throw exception
while( (!tis_leaf()) && tmatch(key)){
if(zero(key, tmask))
t = tleft;
else
t = tright;
}
if(tis_leaf() && tkey == key)
return tvalue;
else
return T(); //or throw exception
}
We can try some test cases to search keys in a Patricia tree we created when
test insertion.
std::cout<<"nlookup4:"<<lookup(tc, 4)
<<"nlookup0:"<<lookup(tc, 0)<<"n";
The output result is as the following.
look up 4: y
look up 0:
Recursive looking up in integer Patricia tree
We can easily change the while-loop in above iterative algorithm into recursive
calls, so that we can have a functional approach.
1: function INT-PATRICIA-LOOK-UP(T, x)
2: if T = NIL then
3: return NIL
4: else if T is a leaf and x = KEY (T) then
5: return V ALUE(T)
6: else if MATCH(x, PREFIX(T), MASK(T)) then
7: if ZERO(x, MASK(T)) then
8: return INT PATRICIALOOK UP
(LEFT(T), x)
9: else
10: return INT PATRICIALOOK UP
(RIGHT(T), x)
11: else
12: return NIL
5.4. INTEGER PATRICIA TREE 131
Look up in big-endian integer Patricia tree in Haskell
By changing the above if-then-else into pattern matching, we can get Haskell
version of looking up program.
-- look up a key
search :: IntTree a Key Maybe a
search t k
= case t of
Empty Nothing
Leaf k x if k==k then Just x else Nothing
Branch p m l r
| match k p m if zero k m then search l k
else search r k
| otherwise Nothing
And we can test this program with looking up some keys in the previously
created Patricia tree.
testIntTree = "t=" ++ (toString t) ++ "nsearch t 4: " ++ (show $ search t 4) ++
"nsearch t 0: " ++ (show $ search t 0)
where
t = fromList [(1, x), (4, y), (5, z)]
main = do
putStrLn testIntTree
The output result is as the following.
t=[0@8](1:x, [4@2](4:y, 5:z))
search t 4: Just y
search t 0: Nothing
Look up in big-endian integer Patricia tree in Scheme/Lisp
Scheme/Lisp program for looking up is similar in case the tree is empty, we just
returns nothing; If it is a leaf node and the key is equal to the number we are
looking for, we nd the result; If it is branch, we need test if binary format of
the prex matches the number, then we recursively search either in left child or
in right child according to the next bit after mask is zero or not.
(define (lookup t k)
(cond ((null? t) ())
((leaf? t) (if (= (key t) k) (value t) ()))
((branch? t) (if (match? k (prefix t) (mask t))
(if (zero-bit? k (mask t))
(lookup (left t) k)
(lookup (right t) k))
()))))
We can test it with the Patricia tree we create in the insertion program.
(define (test-int-patricia)
(define t (listtrie (list (1 "x") (4 "y") (5 "z"))))
(display t) (newline)
(display "lookup4:") (display (lookup t 4)) (newline)
(display "lookup0:") (display (lookup t 0)) (newline))
132CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
The result is like below.
(test-int-patricia)
(0 8 (1 x) (4 2 (4 y) (5 z)))
lookup 4: y
lookup 0: ()
5.5 Alphabetic Trie
Integer based Trie and Patricia Tree can be a good start point. Such tech-
nical plays important role in Compiler implementation. Okasaki pointed that
the widely used Haskell Compiler GHC (Glasgow Haskell Compiler), utilizes a
similar implementation for several years before 1998 [2].
While if we extend the type of the key from integer to alphabetic value,
Trie and Patricia tree can be very useful in textual manipulation engineering
problems.
5.5.1 Denition of alphabetic Trie
If the key is alphabetic value, just left and right children cant represent all
values. For English, there are 26 letters and each can be lower case or upper
case. If we dont care about case, one solution is to limit the number of branches
(children) to 26. Some simplied ANSI C implementation of Trie are dened
by using an array of 26 letters. This can be illustrated as in Figure 5.9.
In each node, not all branches may contain data. for instance, in the above
gure, the root node only has its branches represent letter a, b, and z have sub
trees. Other branches such as for letter c, is empty. For other nodes, empty
branches (point to nil) are not shown.
Ill give such simplied implementation in ANSI C in later section, however,
before we go to the detailed source code, lets consider some alternatives.
In case of language other than English, there may be more letters than 26,
and if we need solve case sensitive problem. we face a problem of dynamic size
of sub branches. There are 2 typical method to represent children, one is by
using Hash table, the other is by using map. Well show these two types of
method in Python and C++.
Denition of alphabetic Trie in ANSI C
ANSI C implementation is to illustrate a simplied approach limited only to
case-insensitive English language. The program cant deal with letters other
than lower case a to z such as digits, space, tab etc.
struct Trie{
struct Trie children[26];
void data;
};
In order to initialize/destroy the children and data, I also provide 2 helper
functions.
struct Trie create_node(){
struct Trie t = (struct Trie)malloc(sizeof(struct Trie));
5.5. ALPHABETIC TRIE 133
a
a b
nil
c
...
z
an
n
o
t
h
e
another
r
o
o
boy
y
bool
l
o
zoo
o
Figure 5.9: A Trie with 26 branches, with key a, an, another, bool, boy and zoo
inserted.
134CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
int i;
for(i=0; i<26; ++i)
tchildren[i]=0;
tdata=0;
return t;
}
void destroy(struct Trie t){
if(!t)
return;
int i;
for(i=0; i<26; ++i)
destroy(tchildren[i]);
if(tdata)
free(tdata);
free(t);
}
Note that, the destroy function uses recursive approach to free all children
nodes.
Denition of alphabetic Trie in C++
With C++ and STL, we can abstract the language and characters as type
parameter. Since the number of characters of the undetermined language varies,
we can use std::map to store children of a node.
template<class Char, class Value>
struct Trie{
typedef Trie<Char, Value> Self;
typedef std::map<Char, Self> Children;
typedef Value ValueType;
Trie():value(Value()){}
virtual ~Trie(){
for(typename Children::iterator it=children.begin();
it!=children.end(); ++it)
delete itsecond;
}
Value value;
Children children;
};
For simple illustration purpose, recursive destructor is used to release the
memory.
Denition of alphabetic Trie in Haskell
We can use Haskell record syntax to get some free accessor functions[4].
data Trie a = Trie { value :: Maybe a
5.5. ALPHABETIC TRIE 135
, children :: [(Char, Trie a)]}
empty = Trie Nothing []
Neither Map nor Hash table is used, just a list of pairs can realize the
same purpose. Function empty can help to create an empty Trie node. This
implementation doesnt constrain the key values to lower case English letters,
it can actually contains any values of Char type.
Denition of alphabetic Trie in Python
In Python version, we can use Hash table as the data structure to represent
children nodes.
class Trie:
def __init__(self):
self.value = None
self.children = {}
Denition of alphabetic Trie in Scheme/Lisp
The denition of alphabetic Trie in Scheme/Lisp is a list of two elements, one
is the value of the node, the other is a children list. The children list is a list of
pairs, one is the character binding to the child, the other is a Trie node.
(define (make-trie v lop) ;; v: value, lop: children, list of char-trie pairs
(cons v lop))
(define (value t)
(if (null? t) () (car t)))
(define (children t)
(if (null? t) () (cdr t)))
In order to create the child and access it easily, we also provide functions for
such purpose.
(define (make-child k t)
(cons k t))
(define (key child)
(if (null? child) () (car child)))
(define (tree child)
(if (null? child) () (cdr child)))
5.5.2 Insertion of alphabetic trie
To insert a key with type of string into a Trie, we pick the rst letter from the
key string. Then check from the root node, we examine which branch among
the children represents this letter. If the branch is null, we then create an empty
node. After that, we pick the next letter from the key string and pick the proper
branch from the grand children of the root.
136CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
We repeat the above process till nishing all the letters of the key. At this
time point, we can nally set the data to be inserted as the value of the node.
Note that the value of root node of Trie is always empty.
Iterative algorithm of trie insertion
The below pseudo code describes the above insertion algorithm.
1: function TRIE-INSERT(T, key, data)
2: if T = NIL then
3: T EmptyNode
4: p = T
5: for each c in key do
6: if CHILDREN(p)[c] = NIL then
7: CHILDREN(p)[c] EmptyNode
8: p CHILDREN(p)[c]
9: DATA(p) data
10: return T
Simplied insertion of alphabetic trie in ANSI C
Go on with the above ANSI C denition, because only lower case English letter
is supported, we can use plan array manipulation to do the insertion.
struct Trie insert(struct Trie t, const char key, void value){
if(!t)
t=create_node();
struct Trie p =t;
while(key){
int c = key - a;
if(!pchildren[c])
pchildren[c] = create_node();
p = pchildren[c];
++key;
}
pdata = value;
return t;
}
In order to test the above program, some helper functions to print content
of the Trie is provided as the following.
void print_trie(struct Trie t, const char prefix){
printf("(%s", prefix);
if(tdata)
printf(":%s", (char)(tdata));
int i;
for(i=0; i<26; ++i){
if(tchildren[i]){
printf(",");
char new_prefix=(char)malloc(strlen(prefix+1)sizeof(char));
sprintf(new_prefix, "%s%c", prefix, i+a);
print_trie(tchildren[i], new_prefix);
5.5. ALPHABETIC TRIE 137
}
}
printf(")");
}
After that, we can test the insertion program with such test cases.
struct Trie test_insert(){
struct Trie t=0;
t = insert(t, "a", 0);
t = insert(t, "an", 0);
t = insert(t, "another", 0);
t = insert(t, "boy", 0);
t = insert(t, "bool", 0);
t = insert(t, "zoo", 0);
print_trie(t, "");
return t;
}
int main(int argc, char argv){
struct Trie t = test_insert();
destroy(t);
return 0;
}
This program will output a Trie like this.
(, (a, (an, (ano, (anot, (anoth, (anothe, (another))))))),
(b, (bo, (boo, (bool)), (boy))), (z, (zo, (zoo))))
It is exactly the Trie as shown in gure 5.9.
Insertion of alphabetic Trie in C++
With above C++ denition, we can utilize STL provided search function in
std::map to locate a child quickly, the program is implemented as the following,
note that if user only provides key for insert, we also insert a default value of
that type.
template<class Char, class Value, class Key>
Trie<Char, Value> insert(Trie<Char, Value> t, Key key, Value value=Value()){
if(!t)
t = new Trie<Char, Value>();
Trie<Char, Value> p(t);
for(typename Key::iterator it=key.begin(); it!=key.end(); ++it){
if(pchildren.find(it) == pchildren.end())
pchildren[it] = new Trie<Char, Value>();
p = pchildren[it];
}
pvalue = value;
return t;
}
template<class T, class K>
T insert_key(T t, K key){
138CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
return insert(t, key);
}
Where insert key() acts as a adapter, well use similar accumulation method
to create trie from list later.
To test this program, we provide the helper functions to print the trie on
console.
template<class T>
std::string trie_to_str(T t, std::string prefix=""){
std::ostringstream s;
s<<"("<<prefix;
if(tvalue != typename T::ValueType())
s<<":"<<tvalue;
for(typename T::Children::iterator it=tchildren.begin();
it!=tchildren.end(); ++it)
s<<","<<trie_to_str(itsecond, prefix+itfirst);
s<<")";
return s.str();
}
After that, we can test our program with some simple test cases.
typedef Trie<char, std::string> TrieType;
TrieType t(0);
const char lst[] = {"a", "an", "another", "b", "bob", "bool", "home"};
t = std::accumulate(lst, lst+sizeof(lst)/sizeof(char), t,
std::ptr_fun(insert_key<TrieType, std::string>));
std::copy(lst, lst+sizeof(lst)/sizeof(char),
std::ostream_iterator<std::string>(std::cout, ","));
std::cout<<"n="<<trie_to_str(t)<<"n";
delete t;
t=0;
const char keys[] = {"001", "100", "101"};
const char vals[] = {"y", "x", "z"};
for(unsigned int i=0; i<sizeof(keys)/sizeof(char); ++i)
t = insert(t, std::string(keys[i]), std::string(vals[i]));
std::copy(keys, keys+sizeof(keys)/sizeof(char),
std::ostream_iterator<std::string>(std::cout, ","));
std::cout<<"="<<trie_to_str(t)<<"n";
delete t;
It will output result like this.
a, an, another, b, bob, bool, home,
==>(, (a, (an, (ano, (anot, (anoth, (anothe, (another))))))), (b, (bo,
(bob), (boo, (bool)))), (h, (ho, (hom, (home)))))
001, 100, 101, ==>(, (0, (00, (001:y))), (1, (10, (100:x), (101:z))))
Insertion of alphabetic trie in Python
In python the implementation is very similar to the pseudo code.
def trie_insert(t, key, value = None):
if t is None:
5.5. ALPHABETIC TRIE 139
t = Trie()
p = t
for c in key:
if not c in p.children:
p.children[c] = Trie()
p = p.children[c]
p.value = value
return t
And we dene the helper functions as the following.
def trie_to_str(t, prefix=""):
str="("+prefix
if t.value is not None:
str += ":"+t.value
for k,v in sorted(t.children.items()):
str += ","+trie_to_str(v, prefix+k)
str+=")"
return str
def list_to_trie(l):
return from_list(l, trie_insert)
def map_to_trie(m):
return from_map(m, trie_insert)
With these helpers, we can test the insert program as below.
class TrieTest:
#...
def test_insert(self):
t = None
t = trie_insert(t, "a")
t = trie_insert(t, "an")
t = trie_insert(t, "another")
t = trie_insert(t, "b")
t = trie_insert(t, "bob")
t = trie_insert(t, "bool")
t = trie_insert(t, "home")
print trie_to_str(t)
It will print a trie in console.
(, (a, (an, (ano, (anot, (anoth, (anothe, (another))))))),
(b, (bo, (bob), (boo, (bool)))), (h, (ho, (hom, (home)))))
Recursive algorithm of Trie insertion
The iterative algorithms can transform to recursive algorithm by such approach.
We take one character from the key, and locate the child branch, then recursively
insert the left characters of the key to that branch. If the branch is empty, we
create a new node and add it to children before doing the recursively insertion.
1: function TRIE-INSERT(T, key, data)
2: if T = NIL then
3: T EmptyNode
140CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
4: if key = NIL then
5: V ALUE(T) data
6: else
7: p FIND(CHILDREN(T), FIRST(key))
8: if p = NIL then
9: p APPEND(CHILDREN(T), FIRST(key), EmptyNode)
10: TRIE INSERT
(p, REST(key))
Look up in alphabetic trie in Haskell
To express this algorithm in Haskell, we can utilize lookup function in Haskell
standard library[4].
find :: Trie a String Maybe a
find t [] = value t
find t (k:ks) = case lookup k (children t) of
Nothing Nothing
Just t find t ks
We can append some search test cases right after insert.
5.6. ALPHABETIC PATRICIA TREE 145
testTrie = "t=" ++ (toString t) ++
"nsearch t an: " ++ (show (find t "an")) ++
"nsearch t boy: " ++ (show (find t "boy")) ++
"nsearch t the: " ++ (show (find t "the"))
...
Here is the search result.
search t an: Just 2
search t boy: Just 3
search t the: Nothing
Look up in alphabetic trie in Scheme/Lisp
In Scheme/Lisp program, if the key is empty, we just return the value of the
current node, else we recursively nd in children of the node to see if there is
a child binding to a character, which match the rst character of the key. We
repeat this process till examine all characters of the key.
(define (lookup t k)
(define (find k lst)
(if (null? lst) ()
(if (string=? k (key (car lst)))
(tree (car lst))
(find k (cdr lst)))))
(if (string-null? k) (value t)
(let ((child (find (string-car k) (children t))))
(if (null? child) ()
(lookup child (string-cdr k))))))
we can test this look up with similar test cases as in Haskell program.
(define (test-trie)
(define t (listtrie (list ("a" 1) ("an" 2) ("another" 7)
("boy" 3) ("bool" 4) ("zoo" 3))))
(display (triestring t)) (newline)
(display "lookupan:") (display (lookup t "an")) (newline)
(display "lookupboy:") (display (lookup t "boy")) (newline)
(display "lookupthe:") (display (lookup t "the")) (newline))
This program will output the following result.
(test-trie)
(., (a1, (an2, (ano., (anot., (anoth., (anothe., (another7))))))),
(b., (bo., (boo., (bool4)), (boy3))), (z., (zo., (zoo3))))
lookup an: 2
lookup boy: 3
lookup the: ()
5.6 Alphabetic Patricia Tree
Alphabetic Trie has the same problem as integer Trie. It is not memory ecient.
We can use the same method to compress alphabetic Trie into Patricia.
146CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
5.6.1 Denition of alphabetic Patricia Tree
Alphabetic patricia tree is a special tree, each node contains multiple branches.
All children of a node share the longest common prex string. There is no node
has only one children, because it is conict with the longest common prex
property.
If we turn the Trie shown in gure 5.9 into Patricia tree by compressing
those nodes which has only one child. we can get a Patricia tree like in gure
5.10.
a
a bo
zoo
zoo
an
n
another
other
bool
ol
boy
y
Figure 5.10: A Patricia tree, with key a, an, another, bool, boy and zoo inserted.
Note that the root node always contains empty value.
Denition of alphabetic Patricia Tree in Haskell
We can use a similar denition as Trie in Haskell, we need change the type of
the rst element of children from single character to string.
type Key = String
data Patricia a = Patricia { value :: Maybe a
, children :: [(Key, Patricia a)]}
empty = Patricia Nothing []
leaf :: a Patricia a
leaf x = Patricia (Just x) []
Besides the denition, helper functions to create a empty Patricia node and
to create a leaf node are provided.
Denition of alphabetic Patricia tree in Python
The denition of Patricia tree is same as Trie in Python.
class Patricia:
def __init__(self, value = None):
self.value = value
5.6. ALPHABETIC PATRICIA TREE 147
self.children = {}
Denition of alphabetic Patricia tree in C++
With ISO C++, we abstract the key type of value type as type parameters, and
utilize STL provide map container to represent children of a node.
template<class Key, class Value>
struct Patricia{
typedef Patricia<Key, Value> Self;
typedef std::map<Key, Self> Children;
typedef Key KeyType;
typedef Value ValueType;
Patricia(Value v=Value()):value(v){}
virtual ~Patricia(){
for(typename Children::iterator it=children.begin();
it!=children.end(); ++it)
delete itsecond;
}
Value value;
Children children;
};
For illustration purpose, we simply release the memory in a recursive way.
Denition of alphabetic Patricia tree in Scheme/Lisp
We can fully reuse the denition of alphabetic Trie in Scheme/Lisp. In order to
provide a easy way to create a leaf node, we dene an extra helper function.
(define (make-leaf x)
(make-trie x ()))
5.6.2 Insertion of alphabetic Patricia Tree
When insert a key, s, into the Patricia tree, if the tree is empty, we can just
create an leaf node. Otherwise, we need check each child of the Patricia tree.
Every branch of the children is binding to a key, we denote them as, s
1
, s
2
, ..., s
n
,
which means there are n branches. if s and s
i
have common prex, we then
need branch out 2 new sub branches. Branch itself is represent with the common
prex, each new branches is represent with the dierent parts. Note there are
2 special cases. One is that s is the substring of s
i
, the other is that s
i
is the
substring of s. Figure 5.11 shows these dierent cases.
Iterative insertion algorithm for alphabetic Patricia
The insertion algorithm can be described as below pseudo code.
1: function PATRICIA-INSERT(T, key, value)
2: if T = NIL then
3: T NewNode
148CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
NIL
boy
(a)
bo
ol y
(b)
x
another
p1 p2 ...
y
an
x
other
p1 p2 ...
(c)
an p1 ...
another
insert
an p1 ...
other
insert
(d)
Figure 5.11: (a). Insert key, boy into an empty Patricia tree, the result is a
leaf node;
(b). Insert key, bool into (a), result is a branch with common prex bo.
(c). Insert an, with value y into node x with prex another.
(d). Insert another, into a node with prex an, the key to be inserted
update to other, and do further insertion.
5.6. ALPHABETIC PATRICIA TREE 149
4: p = T
5: loop
6: match FALSE
7: for each i in CHILDREN(p) do
8: if key = KEY (i) then
9: V ALUE(p) value
10: return T
11: prefix LONGEST COMMONPREFIX(key, KEY (i))
12: key1 key subtract prefix
13: key2 KEY (i) subtract prefix
14: if prefix = NIL then
15: match TRUE
16: if key2 = NIL then
17: p TREE(i)
18: key key substract prefix
19: break
20: else
21: CHILDREN(p)[prefix] BRANCH(key1, value, key2, TREE(i))
22: DELETE CHILDREN(p)[KEY (i)]
23: return T
24: if match = FALSE then
25: CHILDREN(p)[key] CREATE LEAF(value)
26: return T
In the above algorithm, LONGEST-COMMON-PREFIX function will nd
the longest common prex of two given string, for example, string bool and
boy has longest common prex bo. BRANCH function will create a branch
node and update keys accordingly.
Insertion of alphabetic Patricia in C++
in C++, to support implicit type conversion we utilize the KeyType and Val-
ueType as parameter types. If we dene Patriciastd::string, std::string, we
can directly provide char* parameters. the algorithm is implemented as the
following.
template<class K, class V>
Patricia<K, V> insert(Patricia<K, V> t,
typename Patricia<K, V>::KeyType key,
typename Patricia<K, V>::ValueType value=V()){
if(!t)
t = new Patricia<K, V>();
Patricia<K, V> p = t;
typedef typename Patricia<K, V>::Children::iterator Iterator;
for(;;){
bool match(false);
for(Iterator it = pchildren.begin(); it!=pchildren.end(); ++it){
K k=itfirst;
if(key == k){
pvalue = value; //overwrite
return t;
}
150CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
K prefix = lcp(key, k);
if(!prefix.empty()){
match=true;
if(k.empty()){ //e.g. insert "another" into "an"
p = itsecond;
break;
}
else{
pchildren[prefix] = branch(key, new Patricia<K, V>(value),
k, itsecond);
pchildren.erase(it);
return t;
}
}
}
if(!match){
pchildren[key] = new Patricia<K, V>(value);
break;
}
}
return t;
}
Where the lcp and branch functions are dened like this.
template<class K>
K lcp(K& s1, K& s2){
typename K::iterator it1(s1.begin()), it2(s2.begin());
for(; it1!=s1.end() && it2!=s2.end() && it1 == it2; ++it1, ++it2);
K res(s1.begin(), it1);
s1 = K(it1, s1.end());
s2 = K(it2, s2.end());
return res;
}
template<class T>
T branch(typename T::KeyType k1, T t1,
typename T::KeyType k2, T t2){
if(k1.empty()){ //e.g. insert "an" into "another"
t1children[k2] = t2;
return t1;
}
T t = new T();
tchildren[k1] = t1;
tchildren[k2] = t2;
return t;
}
Function lcp() will extract the longest common prex and modify its pa-
rameters. Function branch() will create a new node and set the 2 nodes to
be merged as the children. There is a special case, if the key of one node is
sub-string of the other, it will chain them together.
We nd the implementation of patricia to str() will be very same as trie to str(),
so we can reuse it. Also the convert from a list of keys to trie can be reused
// list_to_trie
5.6. ALPHABETIC PATRICIA TREE 151
template<class Iterator, class T>
T list_to_trie(Iterator first, Iterator last, T t){
typedef typename T::ValueType ValueType;
return std::accumulate(first, last, t,
std::ptr_fun(insert_key<T, ValueType>));
}
We put all of the helper function templates to a utility header le, and we
can test patricia insertion program as below.
template<class Iterator>
void test_list_to_patricia(Iterator first, Iterator last){
typedef Patricia<std::string, std::string> PatriciaType;
PatriciaType t(0);
t = list_to_trie(first, last, t);
std::copy(first, last,
std::ostream_iterator<std::string>(std::cout, ","));
std::cout<<"n="<<trie_to_str(t)<<"n";
delete t;
}
void test_insert(){
const char lst1[] = {"a", "an", "another", "b", "bob", "bool", "home"};
test_list_to_patricia(lst1, lst1+sizeof(lst1)/sizeof(char));
const char lst2[] = {"home", "bool", "bob", "b", "another", "an", "a"};
test_list_to_patricia(lst2, lst2+sizeof(lst2)/sizeof(char));
const char lst3[] = {"romane", "romanus", "romulus"};
test_list_to_patricia(lst3, lst3+sizeof(lst3)/sizeof(char));
typedef Patricia<std::string, std::string> PatriciaType;
PatriciaType t(0);
const char keys[] = {"001", "100", "101"};
const char vals[] = {"y", "x", "z"};
for(unsigned int i=0; i<sizeof(keys)/sizeof(char); ++i)
t = insert(t, std::string(keys[i]), std::string(vals[i]));
std::copy(keys, keys+sizeof(keys)/sizeof(char),
std::ostream_iterator<std::string>(std::cout, ","));
std::cout<<"="<<trie_to_str(t)<<"n";
delete t;
}
Running test insert() function will generate the following output.
a, an, another, b, bob, bool, home,
==>(, (a, (an, (another))), (b, (bo, (bob), (bool))), (home))
home, bool, bob, b, another, an, a,
==>(, (a, (an, (another))), (b, (bo, (bob), (bool))), (home))
romane, romanus, romulus,
==>(, (rom, (roman, (romane), (romanus)), (romulus)))
001, 100, 101, ==>(, (001:y), (10, (100:x), (101:z)))
152CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Insertion of alphabetic Patrica Tree in Python
By translate the insertion algorithm into Python language, we can get a program
as below.
def insert(t, key, value = None):
if t is None:
t = Patricia()
node = t
while(True):
match = False
for k, tr in node.children.items():
if key == k: # just overwrite
node.value = value
return t
(prefix, k1, k2)=lcp(key, k)
if prefix != "":
match = True
if k2 == "":
# example: insert "another" into "an", go on traversing
node = tr
key = k1
break
else: #branch out a new leaf
node.children[prefix] = branch(k1, Patricia(value), k2, tr)
del node.children[k]
return t
if not match: # add a new leaf
node.children[key] = Patricia(value)
break
return t
Where the longest common prex nding and branching functions are im-
plemented as the following.
# longest common prefix
# returns (p, s1, s2), where p is lcp, s1=s1-p, s2=s2-p
def lcp(s1, s2):
j=0
while jlen(s1) and jlen(s2) and s1[0:j]==s2[0:j]:
j+=1
j-=1
return (s1[0:j], s1[j:], s2[j:])
def branch(key1, tree1, key2, tree2):
if key1 == "":
#example: insert "an" into "another"
tree1.children[key2] = tree2
return tree1
t = Patricia()
t.children[key1] = tree1
t.children[key2] = tree2
return t
5.6. ALPHABETIC PATRICIA TREE 153
Function lcp check every characters of two strings are same one by one till
it met a dierent one or either of the string nished.
In order to test the insertion program, some helper functions are provided.
def to_string(t):
return trie_to_str(t)
def list_to_patricia(l):
return from_list(l, insert)
def map_to_patricia(m):
return from_map(m, insert)
We can reuse the trie to str since their implementation are same. to string
function can turn a Patricia tree into string by traversing it in pre-order. list to patricia
helps to convert a list of object into a Patricia tree by repeatedly insert every
elements into the tree. While map to string does similar thing except it can
convert a list of key-value pairs into a Patricia tree.
Then we can test the insertion program with below test cases.
class PatriciaTest:
#...
def test_insert(self):
print "testinsert"
t = list_to_patricia(["a", "an", "another", "b", "bob", "bool", "home"])
print to_string(t)
t = list_to_patricia(["romane", "romanus", "romulus"])
print to_string(t)
t = map_to_patricia({"001":y, "100":x, "101":z})
print to_string(t)
t = list_to_patricia(["home", "bool", "bob", "b", "another", "an", "a"]);
print to_string(t)
These test cases will output a series of result like this.
(, (a, (an, (another))), (b, (bo, (bob), (bool))), (home))
(, (rom, (roman, (romane), (romanus)), (romulus)))
(, (001:y), (10, (100:x), (101:z)))
(, (a, (an, (another))), (b, (bo, (bob), (bool))), (home))
Recursive insertion algorithm for alphabetic Patricia
The insertion can also be implemented recursively. When doing insertion, the
program check all the children of the Patricia node, to see if there is a node can
match the key. Match means they have common prex. One special case is that
the keys are same, the program just overwrite the value of that child. If there
is no child can match the key, the program create a new leaf, and add it as a
new child.
1: function PATRICIA-INSERT(T, key, value)
2: if T = NIL then
3: T EmptyNode
4: p FIND MATCH(CHILDREN(T), key)
5: if p = NIL then
6: ADD(CHILDREN(T), CREATE LEAF(key, value))
154CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
7: else if KEY (p) = key then
8: V ALUE(p) value
9: else
10: q BRANCH(CREATE LEAF(key, value), p)
11: ADD(CHILDREN(T), q)
12: DELETE(CHILDREN(T), p)
13: return T
The recursion happens inside call to BRANCH. The longest common prex
of 2 nodes are extracted. If the key to be inserted is the sub-string of the node,
we just chain them together; If the prex of the node is the sub-string of the
key, we recursively insert the rest of the key to the node. In other case, we
create a new node with the common prex and set its two children.
1: function BRANCH(T1, T2)
2: prefix LONGEST COMMON PREFIX(T1, T2)
3: p EmptyNode
4: if prefix = KEY (T1) then
5: KEY (T2) KEY (T2) subtract prefix
6: p CREATE LEAF(prefix, V ALUE(T1))
7: ADD(CHILDREN(p), T2)
8: else if prefix = KEY (T2) then
9: KEY (T1) KEY (T1) subtract prefix
10: p PATRICIAINSERT
(FIRST(l), key)
9: else
10: return FIND IN CHILDREN(REST(l), key)
Look up in alphabetic Patrica Tree in Haskell
In Haskell implementation, the above algorithm should be turned into recursive
way.
-- lookup
import qualified Data.List
find :: Patricia a Key Maybe a
find t k = find (children t) k where
find [] _ = Nothing
find (p:ps) k
| (fst p) == k = value (snd p)
| (fst p) Data.List.isPrefixOf k = find (snd p) (diff (fst p) k)
| otherwise = find ps k
diff k1 k2 = drop (length (lcp k1 k2)) k2
When we search a given key in a Patricia tree, we recursively check each of
the child. If there are no children at all, we stop the recursion and indicate a
look up failure. In other case, we pick the prex-node pair one by one. If the
prex is as same as the given key, it means the target node is found and the
value of the node is returned. If the key has common prex with the child, the
key will be updated by removing the longest common prex and we performs
looking up recursively.
5.6. ALPHABETIC PATRICIA TREE 161
We can verify the above Haskell program with the following simple cases.
testPatricia = "t1=" ++ (toString t1) ++ "n" ++
"find t1 another =" ++ (show (find t1 "another")) ++ "n" ++
"find t1 bo = " ++ (show (find t1 "bo")) ++ "n" ++
"find t1 boy = " ++ (show (find t1 "boy")) ++ "n" ++
"find t1 boolean = " ++ (show (find t1 "boolean"))
where
t1 = fromList [("a", 1), ("an", 2), ("another", 7), ("boy", 3),
("bool", 4), ("zoo", 3)]
main = do
putStrLn testPatricia
The output is as below.
t1=(, (a:1, (an:2, (another:7))), (bo, (bool:4), (boy:3)), (zoo:3))
find t1 another =Just 7
find t1 bo = Nothing
find t1 boy = Just 3
find t1 boolean = Nothing
Look up in alphabetic Patricia Tree in Scheme/Lisp
The Scheme/Lisp program is given as the following. The function delegate the
looking up to an inner function which will check each child to see if the key
binding to the child match the string we are looking for.
(define (lookup t k)
(define (find lst k) ;; lst, [(k patricia)]
(if (null? lst) ()
(cond ((string=? (key (car lst)) k) (value (tree (car lst))))
((string-prefix? (key (car lst)) k)
(lookup (tree (car lst))
(string-tail k (string-length (key (car lst))))))
(else (find (cdr lst) k)))))
(find (children t) k))
In order to verify this program, some simple test cases are given to search
in the Patricia we created in previous section.
(define (test-patricia)
(define t (listtrie (list ("a" 1) ("an" 2) ("another" 7)
("boy" 3) ("bool" 4) ("zoo" 3))))
(display (triestring t)) (newline)
(display "lookupanother:") (display (lookup t "another")) (newline)
(display "lookupbo:") (display (lookup t "bo")) (newline)
(display "lookupboy:") (display (lookup t "boy")) (newline)
(display "lookupby:") (display (lookup t "by")) (newline)
(display "lookupboolean:") (display (lookup t "boolean")) (newline))
This program will output the same result as the Haskell one.
(test-patricia)
(., (a1, (an2, (another7))), (bo., (bool4), (boy3)), (zoo3))
lookup another: 7
lookup bo: ()
162CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
lookup boy: 3
lookup by: ()
lookup boolean: ()
5.7 Trie and Patricia used in Industry
Trie and Patricia are widely used in software industry. Integer based Patricia
tree is widely used in compiler. Some daily used software has very interest-
ing features can be realized with Trie and Patricia. In the following sections,
Ill list some of them, including, e-dictionary, word auto-completion, t9 input
method etc. The commercial implementation typically doesnt adopt Trie or
Patricia directly. However, Trie and Patricia can be shown as a kind of example
realization.
5.7.1 e-dictionary and word auto-completion
Figure 5.12 shows a screen shot of an English-Chinese dictionary. In order to
provide good user experience, when user input something, the dictionary will
search its word library, and list all candidate words and phrases similar to what
user have entered.
Figure 5.12: e-dictionary. All candidates starting with what user input are
listed.
Typically such dictionary contains hundreds of thousands words, performs a
whole word search is expensive. Commercial software adopts complex approach,
including caching, indexing etc to speed up this process.
Similar with e-dictionary, gure 5.13 shows a popular Internet search engine,
when user input something, it will provide a candidate lists, with all items start
with what user has entered. And these candidates are shown in an order of
5.7. TRIE AND PATRICIA USED IN INDUSTRY 163
popularity. The more people search for a word, the upper position it will be
shown in the list.
Figure 5.13: Search engine. All candidates key words starting with what user
input are listed.
In both case, we say the software provide a kind of word auto-completion
support. In some modern IDE, the editor can even helps user to auto-complete
programmings.
In this section, Ill show a very simple implementation of e-dictionary with
Trie and Patricia. To simplify the problem, let us assume the dictionary only
support English - English information.
Typically, a dictionary contains a lot of key-value pairs, the keys are English
words or phrases, and the relative values are the meaning of the words.
We can store all words and their meanings to a Trie, the drawback for this
approach is that it isnt space eective. Well use Patricia as alternative later
on.
As an example, when user want to look up a, the dictionary does not only
return the meaning of then English word a, but also provide a list of candidate
words, which are all start with a, including abandon, about, accent, adam,
... Of course all these words are stored in Trie.
If there are too many candidates, one solution is only display the top 10
words for the user, and if he like, he can browse more.
Below pseudo code reuse the looking up program in previous sections and
Expand all potential top N candidates.
1: function TRIE-LOOK-UP-TOP-N(T, key, N)
2: p TRIE LOOK UP
(T, key)
3: return EXPAND TOP N(p, key, N)
Note that we should modify the TRIE-LOOK-UP a bit, instead of return
the value of the node, TRIE-LOOK-UP returns the node itself.
Another alternative is to use Patricia instead of Trie. It can save much
164CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
spaces.
Iterative algorithm of search top N candidate in Patricia
The algorithm is similar to the Patricia look up one, but when we found a node
which key start from the string we are looking for, we expand all its children
until we get N candidates.
1: function PATRICIA-LOOK-UP-TOP-N(T, key, N)
2: if T = NIL then
3: return NIL
4: prefix NIL
5: repeat
6: match FALSE
7: for each i in CHILDREN(T) do
8: if key is prex of KEY (i) then
9: return EXPAND TOP N(TREE(i), prefix, N)
10: if KEY (i) is prex of key then
11: match TRUE
12: key key subtract KEY (i)
13: T TREE(i)
14: prefix prefix +KEY (i)
15: break
16: until match = FALSE
17: return NIL
An e-dictionary in Python
In Python implementation, a function trie lookup is provided to perform search
all top N candidate started with a given string.
def trie_lookup(t, key, n):
if t is None:
return None
p = t
for c in key:
if not c in p.children:
return None
p=p.children[c]
return expand(key, p, n)
def expand(prefix, t, n):
res = []
q = [(prefix, t)]
while len(res)<n and len(q)>0:
(s, p) = q.pop(0)
if p.value is not None:
res.append((s, p.value))
for k, tr in p.children.items():
q.append((s+k, tr))
return res
5.7. TRIE AND PATRICIA USED IN INDUSTRY 165
Compare with the Trie look up function, the rst part of this program is
almost same. The dierence part is after we successfully located the node which
matches the key, all sub trees are expanded from this node in a bread-rst search
manner, and the top n candidates are returned.
This program can be veried by below simple test cases.
class LookupTest:
def __init__(self):
dict = {"a":"thefirstletterofEnglish",
"an":"...samedictasinHaskellexample"}
self.tt = trie.map_to_trie(dict)
def run(self):
self.test_trie_lookup()
def test_trie_lookup(self):
print "testlookuptop5"
print "searcha", trie_lookup(self.tt, "a", 5)
print "searchab", trie_lookup(self.tt, "ab", 5)
The test will output the following result.
test lookup to 5
search a [(a, the first letter of English), (an, "used instead of a
when the following word begins with a vowel sound"), (adam, a character in
the Bible who was the first man made by God), (about, on the subject of;
connected with), (abandon, to leave a place, thing or person forever)]
search ab [(about, on the subject of; connected with), (abandon, to
leave a place, thing or person forever)]
To save the spaces, we can also implement such a dictionary search by using
Patricia.
def patricia_lookup(t, key, n):
if t is None:
return None
prefix = ""
while(True):
match = False
for k, tr in t.children.items():
if string.find(k, key) == 0: #is prefix of
return expand(prefix+k, tr, n)
if string.find(key, k) ==0:
match = True
key = key[len(k):]
t = tr
prefix += k
break
if not match:
return None
In this program, we called Python string class to test if a string x is prex
of a string y. In case we locate a node with the key we are looking up is either
equal of as prex of the this sub tree, we expand it till we nd n candidates.
Function expand() can be reused here.
166CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
We can test this program with the very same test cases and the results are
identical to the previous one.
An e-dictionary in C++
In C++ implementation, we overload the look up function by providing an extra
integer n to indicate we want to search top n candidates. the result is a list of
key-value pairs,
//lookup top n candidate with prefix key in Trie
template<class K, class V>
std::list<std::pair<K, V> > lookup(Trie<K, V> t,
typename Trie<K, V>::KeyType key,
unsigned int n)
{
typedef std::list<std::pair<K, V> > Result;
if(!t)
return Result();
Trie<K, V> p(t);
for(typename K::iterator it=key.begin(); it!=key.end(); ++it){
if(pchildren.find(it) == pchildren.end())
return Result();
p = pchildren[it];
}
return expand(key, p, n);
}
The program is almost same as the Trie looking up one, except it will call
expand function when it located the node with the key. Function expand is as
the following.
template<class T>
std::list<std::pair<typename T::KeyType, typename T::ValueType> >
expand(typename T::KeyType prefix, T t, unsigned int n)
{
typedef typename T::KeyType KeyType;
typedef typename T::ValueType ValueType;
typedef std::list<std::pair<KeyType, ValueType> > Result;
Result res;
std::queue<std::pair<KeyType, T> > q;
q.push(std::make_pair(prefix, t));
while(res.size()<n && (!q.empty())){
std::pair<KeyType, T> i = q.front();
KeyType s = i.first;
T p = i.second;
q.pop();
if(pvalue != ValueType()){
res.push_back(std::make_pair(s, pvalue));
}
for(typename T::Children::iterator it = pchildren.begin();
it!=pchildren.end(); ++it)
q.push(std::make_pair(s+itfirst, itsecond));
}
5.7. TRIE AND PATRICIA USED IN INDUSTRY 167
return res;
}
This function use a bread-rst search approach to expand top N candidates,
it maintain a queue to store the node it is currently dealing with. Each time the
program picks a candidate node from the queue, expands all its children and
put them to the queue. the program will terminate when the queue is empty or
we have already found N candidates.
Function expand is generic well use it in later sections.
Then we can provide a helper function to convert the candidate list to read-
able string. Note that this list is actually a list of pairs so we can provide a
generic function.
//list of pairs to string
template<class Container>
std::string lop_to_str(Container coll){
typedef typename Container::iterator Iterator;
std::ostringstream s;
s<<"[";
for(Iterator it=coll.begin(); it!=coll.end(); ++it)
s<<"("<<itfirst<<","<<itsecond<<"),";
s<<"]";
return s.str();
}
After that, we can test the program with some simple test cases.
Trie<std::string, std::string> t(0);
const char dict[] = {
"a", "thefirstletterofEnglish",
"an", "usedinsteadofawhenthefollowingwordbeginswithavowelsound",
"another", "onemorepersonorthingoranextraamount",
"abandon", "toleaveaplace,thingorpersonforever",
"about", "onthesubjectof;connectedwith",
"adam", "acharacterintheBiblewhowasthefirstmanmadebyGod",
"boy", "amalechildor,moregenerally,amaleofanyage",
"body", "thewholephysicalstructurethatformsapersonoranimal",
"zoo", "anareainwhichanimals,especiallywildanimals,arekept"
"sothatpeoplecangoandlookatthem,orstudythem"};
const char first=dict;
const char last =dict + sizeof(dict)/sizeof(char);
for(;first!=last; ++first, ++first)
t = insert(t, first, (first+1));
}
std::cout<<"testlookuptop5inTrien"
<<"searcha"<<lop_to_str(lookup(t, "a", 5))<<"n"
<<"searchab"<<lop_to_str(lookup(t, "ab", 5))<<"n";
delete t;
The result print to the console is something like this:
test lookup top 5 in Trie
search a [(a, the first letter of English), (an, used instead of a
when the following word begins with a vowel sound), (adam, a character
168CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
in the Bible who was the first man made by God), (about, on the
subject of; connected with), (abandon, to leave a place, thing or
person forever), ]
search ab [(about, on the subject of; connected with), (abandon, to
leave a place, thing or person forever), ]
To save the the space with Patricia, we provide a C++ program to search
top N candidate as below.
template<class K, class V>
std::list<std::pair<K, V> > lookup(Patricia<K, V> t,
typename Patricia<K, V>::KeyType key,
unsigned int n)
{
typedef typename std::list<std::pair<K, V> > Result;
typedef typename Patricia<K, V>::Children::iterator Iterator;
if(!t)
return Result();
K prefix;
for(;;){
bool match(false);
for(Iterator it=tchildren.begin(); it!=tchildren.end(); ++it){
K k(itfirst);
if(is_prefix_of(key, k))
return expand(prefix+k, itsecond, n);
if(is_prefix_of(k, key)){
match = true;
prefix += k;
lcp<K>(key, k); //update key
t = itsecond;
break;
}
}
if(!match)
return Result();
}
}
The program iterate all children if the string we are looked up is prex of
one child, we expand this child to nd top N candidates; If the in the opposite
case, we update the string and go on examine into this child Patricia tree.
Where the function is prex of() is dened as below.
// x is prefix of y?
template<class T>
bool is_prefix_of(T x, T y){
if(x.size()y.size())
return std::equal(x.begin(), x.end(), y.begin());
return false;
}
end{lstlisitng}
We use STL equal function to check if x is prefix of y.
The test case is nearly same as the one in Trie.
5.7. TRIE AND PATRICIA USED IN INDUSTRY 169
begin{lstlisting}
Patricia<std::string, std::string> t(0);
const char dict[] = {
"a", "thefirstletterofEnglish",
"an", "usedinsteadofawhenthefollowingwordbeginswithavowelsound",
"another", "onemorepersonorthingoranextraamount",
"abandon", "toleaveaplace,thingorpersonforever",
"about", "onthesubjectof;connectedwith",
"adam", "acharacterintheBiblewhowasthefirstmanmadebyGod",
"boy", "amalechildor,moregenerally,amaleofanyage",
"body", "thewholephysicalstructurethatformsapersonoranimal",
"zoo", "anareainwhichanimals,especiallywildanimals,arekept"
"sothatpeoplecangoandlookatthem,orstudythem"};
const char first=dict;
const char last =dict + sizeof(dict)/sizeof(char);
for(;first!=last; ++first, ++first)
t = insert(t, first, (first+1));
}
std::cout<<"testlookuptop5inTrien"
<<"searcha"<<lop_to_str(lookup(t, "a", 5))<<"n"
<<"searchab"<<lop_to_str(lookup(t, "ab", 5))<<"n";
delete t;
This test case will output a very same result in console.
Recursive algorithm of search top N candidate in Patricia
This algorithm can also be implemented recursively, if the string we are looking
for is empty, we expand all children until we get N candidates. else we recursively
examine the children of the node to see if we can nd one has prex as this string.
1: function PATRICIA-LOOK-UP-TOP-N(T, key, N)
2: if T = NIL then
3: return NIL
4: if KEY = NIL then
5: return EXPAND TOP N(T, NIL, N)
6: else
7: return FINDINCHILDRENTOPN(CHILDREN(T), key, N)
8: function FIND-IN-CHILDREN-TOP-N(l, key, N)
9: if l = NIL then
10: return NIL
11: else if KEY (FIRST(l)) = key then
12: return EXPAND TOP N(FIRST(l), key, N)
13: else if KEY (FIRST(l) is prex of key then
14: return PATRICIA LOOK UP TOP N(FIRST(l), key
subtract KEY (FIRST(l)))
15: else if key is prex of KEY (FIRST(l) then
16: return PATRICIALOOKUP TOP N(FIRST(l), NIL, N)
17: else if
18: thenreturn FINDINCHILDRENTOPN(REST(l), key, N)
170CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
An e-dictionary in Haskell
In Haskell implementation, we provide a function named as ndAll. Thanks for
the lazy evaluation support, ndAll wont produce all candidates words until we
need them. we can use something like take 10 ndAll to get the top 10 words
easily.
ndAll is given as the following.
findAll:: Trie a String [(String, a)]
findAll t [] =
case value t of
Nothing enum (children t)
Just x ("", x):(enum (children t))
where
enum [] = []
enum (p:ps) = (mapAppend (fst p) (findAll (snd p) [])) ++ (enum ps)
findAll t (k:ks) =
case lookup k (children t) of
Nothing []
Just t mapAppend k (findAll t ks)
mapAppend x lst = map (p(x:(fst p), (snd p))) lst
function ndAll take a Trie, a word to be looked up, it will output a list of
pairs, the rst element of the pair is the candidate word, the second element of
the pair is the meaning of the word.
Compare with the nd function of Trie, the none-trivial case is very similar.
We take a letter form the words to be looked up, if there is no child starting
with this letter, the program returns empty list. If there is such a child starting
with this letter, this child should be a candidate. We use function mapAppend
to add this letter in front of all elements of recursively founded candidate words.
In case we consumed all letters, we next returns all potential words, which
means the program will traverse all children of the current node.
Note that only the node with value eld not equal to None is a meaningful
word in our dictionary. We need append the list with the right meaning.
With this function, we can construct a very simple dictionary and return
top 5 candidate to user. Here is the test program.
testFindAll = "nlook up a: " ++ (show $ take 5 $findAll t "a") ++
"nlook up ab: " ++ (show $ take 5 $findAll t "ab")
where
t = fromList [
("a", "the first letter of English"),
("an", "used instead of a when the following word begins with"
"a vowel sound"),
("another", "one more person or thing or an extra amount"),
("abandon", "to leave a place, thing or person forever"),
("about", "on the subject of; connected with"),
("adam", "a character in the Bible who was the first man made by God"),
("boy", "a male child or, more generally, a male of any age"),
("body", "the whole physical structure that forms a person or animal"),
("zoo", "an area in which animals, especially wild animals, are kept"
" so that people can go and look at them, or study them")]
5.7. TRIE AND PATRICIA USED IN INDUSTRY 171
main = do
putStrLn testFindAll
This program will out put a result like this:
look up a: [("a","the first letter of English"),("an","used instead of a
when the following word begins with a vowel sound"),("another","one more
person or thing or an extra amount"),("abandon","to leave a place, thing
or person forever"),("about","on the subject of; connected with")]
look up ab: [("abandon","to leave a place, thing or person forever"),
("about","on the subject of; connected with")]
The Trie solution wasts a lot of spaces. It is very easy to improve the above
program with Patricia. Below source code shows the Patricia approach.
findAll :: Patricia a Key [(Key, a)]
findAll t [] =
case value t of
Nothing enum $ children t
Just x ("", x):(enum $ children t)
where
enum [] = []
enum (p:ps) = (mapAppend (fst p) (findAll (snd p) [])) ++ (enum ps)
findAll t k = find (children t) k where
find [] _ = []
find (p:ps) k
| (fst p) == k
= mapAppend k (findAll (snd p) [])
| (fst p) Data.List.isPrefixOf k
= mapAppend (fst p) (findAll (snd p) (k diff (fst p)))
| k Data.List.isPrefixOf (fst p)
= findAll (snd p) []
| otherwise = find ps k
diff x y = drop (length y) x
mapAppend s lst = map (p(s++(fst p), snd p)) lst
If compare this program with the one implemented by Trie, we can nd they
are very similar to each other. In none-trivial case, we just examine each child
to see if any one match the key to be looked up. If one child is exactly equal
to the key, we then expand all its sub branches and put them to the candidate
list. If the child correspond to a prex of the key, the program goes on nd the
the rest part of the key along this child and concatenate this prex to all later
results. If the current key is prex to a child, the program will traverse this
child and return all its sub branches as candidate list.
This program can be tested with the very same case as above, and it will
output the same result.
An e-dictionary in Scheme/Lisp
In Scheme/Lisp implementation with Trie, a function named nd is used to
search all candidates start with a given string. If the string is empty, the program
will enumerate all sub trees as result; else the program calls an inner function
nd-child to search a child which matches the rst character of the given string.
172CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Then the program recursively apply the nd function to this child with the rest
characters of the string to be searched.
(define (find t k)
(define (find-child lst k)
(tree (find-matching-item lst (lambda (c) (string=? (key c) k)))))
(if (string-null? k)
(enumerate t)
(let ((t-new (find-child (children t) (string-car k))))
(if (null? t-new) ()
(map-string-append (string-car k) (find t-new (string-cdr k)))))))
Note that the map-string-append will insert the rst character to all the
elements (more accurately, each element is a pair with a key and a value, map-
string-append insert the character in front of each key) in the result returned
by recursive call. It is dened like this.
(define (map-string-append x lst) ;; lst: [(key value)]
(map (lambda (p) (cons (string-append x (car p)) (cdr p))) lst))
The enumerate function which can expend all sub trees is implemented as
the following.
(define (enumerate t) ;; enumerate all sub trees
(if (null? t) ()
(let ((res (append-map
(lambda (p)(map-string-append (key p)(enumerate (tree p))))
(children t))))
(if (null? (value t)) res
(cons (cons "" (value t)) res)))))
The test case is a very simple list of word-meaning pairs.
(define dict
(list ("a" "thefirstletterofEnglish")
("an" "usedinsteadofawhenthefollowingwordbeginswithavowelsound")
("another" "onemorepersonorthingoranextraamount")
("abandon" "toleaveaplace,thingorpersonforever")
("about" "onthesubjectof;connectedwith")
("adam" "acharacterintheBiblewhowasthefirstmanmadebyGod")
("boy" "amalechildor,moregenerally,amaleofanyage")
("body" "thewholephysicalstructurethatformsapersonoranimal")
("zoo" "anareainwhichanimals,especiallywildanimals,
arekeptsothatpeoplecangoandlookatthem,orstudythem")))
After feed this dict to a Trie, if user tries to nd a* or ab* like below.
(define (test-trie-find-all)
(define t (listtrie dict))
(display "finda:") (display (find t "a")) (newline)
(display "findab:") (display (find t "ab")) (newline))
The result is a list with all candidates start with the given string.
(test-trie-find-all)
find a*: ((a . the first letter of English) (an . used instead of a
when the following word begins with a vowel sound) (another . one more
person or thing or an extra amount) (abandon . to leave a place, thing
5.7. TRIE AND PATRICIA USED IN INDUSTRY 173
or person forever) (about . on the subject of; connected with) (adam
. a character in the Bible who was the first man made by God))
find ab*: ((abandon . to leave a place, thing or person forever)
(about . on the subject of; connected with))
Trie approach isnt space eective. Patricia can be one alternative to improve
in terms of space.
We can fully reuse the function enumerate, map-string-append which are
dened for trie. the nd function for Patricia is implemented as the following.
(define (find t k)
(define (find-child lst k)
(if (null? lst) ()
(cond ((string=? (key (car lst)) k)
(map-string-append k (enumerate (tree (car lst)))))
((string-prefix? (key (car lst)) k)
(let ((k-new (string-tail k (string-length (key (car lst))))))
(map-string-append (key (car lst)) (find (tree (car lst)) k-new))))
((string-prefix? k (key (car lst))) (enumerate (tree (car lst))))
(else (find-child (cdr lst) k)))))
(if (string-null? k)
(enumerate t)
(find-child (children t) k)))
If the same test cases of search all candidates of a* and ab* are fed we
can get a very same result.
5.7.2 T9 input method
Most mobile phones around year 2000 has a key pad. To edit a short mes-
sage/email with such key-pad, users typically have quite dierent experience
from PC. Because a mobile-phone key pad, or so called ITU-T key pad has few
keys. Figure g:itut-keypad shows an example.
Figure 5.14: an ITU-T keypad for mobile phone.
174CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
There are typical 2 methods to input an English word/phrase with ITU-T
key pad. For instance, if user wants to enter a word home, He can press the
key in below sequence.
Press key 4 twice to enter the letter h;
Press key 6 three times to enter the letter o;
Press key 6 twice to enter the letter m;
Press key 3 twice to enter the letter e;
Another high ecient way is to simplify the key press sequence like the
following.
Press key 4, 6, 6, 3, word home appears on top of the candidate
list;
Press key * to change a candidate word, so word good appears;
Press key * again to change another candidate word, next word gone
appears;
...
Compare these 2 method, we can see method 2 is much easier for the end
user, and it is operation ecient. The only overhead is to store a candidate
words dictionary.
Method 2 is called as T9 input method, or predictive input method [6],
[7]. The abbreviation T9 stands for textonym. In this section, Ill show an
example implementation of T9 by using Trie and Patricia.
In order to provide candidate words to user, a dictionary must be prepared
in advance. Trie or Patricia can be used to store the Dictionary. In the real
commercial software, complex indexing dictionary is used. We show the very
simple Trie and Patricia only for illustration purpose.
Iterative algorithm of T9 looking up
Below pseudo code shows how to realize T9 with Trie.
1: function TRIE-LOOK-UP-T9(T, key)
2: PUSH BACK(Q, NIL, key, T)
3: r NIL
4: while Q is not empty do
5: p, k, t POP FRONT(Q)
6: i FIRST LETTER(k)
7: for each c in T9 MAPPING(i) do
8: if c is in CHILDREN(t) then
9: k
k subtract i
10: if k
is empty then
11: APPEND(r, p +c)
12: else
13: PUSH BACK(Q, p +c, k
, CHILDREN(t)[c])
14: return r
5.7. TRIE AND PATRICIA USED IN INDUSTRY 175
This is actually a bread-rst search program. It utilizes a queue to store
the current node and key string we are examining. The algorithm takes the
rst digit from the key, looks up it in T9 mapping to get all English letters
corresponding to this digit. For each letter, if it can be found in the children
of current node, the node along with the English string found so far are push
back to the queue. In case all digits are examined, a candidate is found. Well
append this candidate to the result list. The loop will terminate when the queue
is empty.
Since Trie is not space eective, minor modication of the above program
can work with Patricia, which can help to save extra spaces.
1: function PATRICIA-LOOK-UP-T9(T, key)
2: PUSH BACK(Q, NIL, key, T)
3: r NIL
4: while Q is not empty do
5: p, k, t POP FRONT(Q)
6: for each child in CHILDREN(t) do
7: k
IS-PREFIX-OF k then
9: if k
= k then
10: APPEND(r, p +KEY (child))
11: else
12: PUSH BACK(Q, p +KEY (child), k k
, child)
13: return r
T9 implementation in Python
In Python implementation, T9 looking up is realized in a typical bread-rst
search algorithm as the following.
T9MAP={2:"abc", 3:"def", 4:"ghi", 5:"jkl",
6:"mno", 7:"pqrs", 8:"tuv", 9:"wxyz"}
def trie_lookup_t9(t, key):
if t is None or key == "":
return None
q = [("", key, t)]
res = []
while len(q)>0:
(prefix, k, t) = q.pop(0)
i=k[0]
if not i in T9MAP:
return None #invalid input
for c in T9MAP[i]:
if c in t.children:
if k[1:]=="":
res.append((prefix+c, t.children[c].value))
else:
q.append((prefix+c, k[1:], t.children[c]))
return res
Function trie lookup t9 check if the parameters are valid rst. Then it push
the initial data into a queue. The program repeatedly pop the item from the
176CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
queue, including what node it will examine next, the number sequence string,
and the alphabetic string it has been searched.
For each popped item, the program takes the next digit from the number
sequence, and looks up in T9 map to nd the corresponding English letters.
With all these letters, if they can be found in the children of the current node,
well push this child along with the updated number sequence string and updated
alphabetic string into the queue. In case we process all numbers, we nd a
candidate result.
We can verify the above program with the following test cases.
class LookupTest:
def __init__(self):
t9dict = ["home", "good", "gone", "hood", "a", "another", "an"]
self.t9t = trie.list_to_trie(t9dict)
def test_trie_t9(self):
print "search4", trie_lookup_t9(self.t9t, "4")
print "search46", trie_lookup_t9(self.t9t, "46")
print "search4663", trie_lookup_t9(self.t9t, "4663")
print "search2", trie_lookup_t9(self.t9t, "2")
print "search22", trie_lookup_t9(self.t9t, "22")
If we run the test, it will output a very same result as the above Haskell
program.
search 4 [(g, None), (h, None)]
search 46 [(go, None), (ho, None)]
search 4663 [(gone, None), (good, None), (home, None), (hood, None)]
search 2 [(a, None)]
search 22 []
To save the spaces, Patricia can be used instead of Trie.
def patricia_lookup_t9(t, key):
if t is None or key == "":
return None
q = [("", key, t)]
res = []
while len(q)>0:
(prefix, key, t) = q.pop(0)
for k, tr in t.children.items():
digits = toT9(k)
if string.find(key, digits)==0: #is prefix of
if key == digits:
res.append((prefix+k, tr.value))
else:
q.append((prefix+k, key[len(k):], tr))
return res
Compare to the implementation with Trie, they are very similar. We also
used a bread-rst search approach. The dierent part is that we convert the
string of each child to number sequence string according to T9 mapping. if it is
prex of the key we are looking for, we push this child along with updated key
and prex. In case we examined all digits, we nd a candidate result.
The convert function is a reverse mapping process as below.
5.7. TRIE AND PATRICIA USED IN INDUSTRY 177
def toT9(s):
res=""
for c in s:
for k, v in T9MAP.items():
if string.find(v, c)0:
res+=k
break
#error handling skipped.
return res
For illustration purpose, the error handling for invalid letters is skipped.
If we feed the program with the same test cases, we can get a result as the
following.
search 4 []
search 46 [(go, None), (ho, None)]
search 466 []
search 4663 [(good, None), (gone, None), (home, None), (hood, None)]
search 2 [(a, None)]
search 22 []
The result is slightly dierent from the one output by Trie. The reason is
as same as what we analyzed in Haskell implementation. It is easily to modify
the program to output a similar result.
T9 implemented in C++
First we dene T9 mapping as a Singleton object, this is because we want to it
can be used both in Trie look up and Patricia look up programs.
struct t9map{
typedef std::map<char, std::string> Map;
Map map;
t9map(){
map[2]="abc";
map[3]="def";
map[4]="ghi";
map[5]="jkl";
map[6]="mno";
map[7]="pqrs";
map[8]="tuv";
map[9]="wxyz";
}
static t9map& inst(){
static t9map i;
return i;
}
};
Note in other languages or keypad layout, we can dene dierent mappings
and pass them as an argument to the looking up function.
178CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
With this mapping, the looking up in Trie can be given as below. Although
we want to keep the genericity of the program, for illustration purpose, we just
simply use the t9 mapping directly.
In order to keep the code as short as possible, a boost library tool, boost::tuple
is used. For more about boost::tuple, please refer to [8].
template<class K, class V>
std::list<std::pair<K, V> > lookup_t9(Trie<K, V> t,
typename Trie<K, V>::KeyType key)
{
typedef std::list<std::pair<K, V> > Result;
typedef typename Trie<K, V>::KeyType Key;
typedef typename Trie<K, V>::Char Char;
if((!t) | | key.empty())
return Result();
Key prefix;
std::map<Char, Key> m = t9map::inst().map;
std::queue<boost::tuple<Key, Key, Trie<K, V>> > q;
q.push(boost::make_tuple(prefix, key, t));
Result res;
while(!q.empty()){
boost::tie(prefix, key, t) = q.front();
q.pop();
Char c = key.begin();
key = Key(key.begin()+1, key.end());
if(m.find(c) == m.end())
return Result();
Key cs = m[c];
for(typename Key::iterator it=cs.begin(); it!=cs.end(); ++it)
if(tchildren.find(it)!=tchildren.end()){
if(key.empty())
res.push_back(std::make_pair(prefix+it, tchildren[it]value));
else
q.push(boost::make_tuple(prefix+it, key, tchildren[it]));
}
}
return res;
}
This program will rst check if the Patricia tree or the key are empty to deal
with with trivial case. It next initialize a queue, and push one tuple to it. the
tuple contains 3 elements, a prex to represent a string the program has been
searched, current key it need look up, and a node it will examine.
Then the program repeatedly pops the tuple from the queue, takes the rst
character from the key, and looks up in T9 map to get a candidate English
letter list. With each letter in this list, the program examine if it exists in the
children of current node. In case it nd such a child, if there is no left letter
to look up, it means we found a candidate result, we push it to the result list.
Else, we create a new tuple with updated prex, key and this child; the push it
to the queue for later process.
Below are some simple test cases for verication.
5.7. TRIE AND PATRICIA USED IN INDUSTRY 179
Trie<std::string, std::string> t9trie(0);
const char t9dict[] = {"home", "good", "gone", "hood", "a", "another", "an"};
t9trie = list_to_trie(t9dict, t9dict+sizeof(t9dict)/sizeof(char), t9trie);
std::cout<<"testt9lookupinTrien"
<<"search4"<<lop_to_str(lookup_t9(t9trie, "4"))<<"n"
<<"serach46"<<lop_to_str(lookup_t9(t9trie, "46"))<<"n"
<<"serach4663"<<lop_to_str(lookup_t9(t9trie, "4663"))<<"n"
<<"serach2"<<lop_to_str(lookup_t9(t9trie, "2"))<<"n"
<<"serach22"<<lop_to_str(lookup_t9(t9trie, "22"))<<"nn";
delete t9trie;
It will output the same result as the Python program.
test t9 lookup in Trie
search 4 [(g, ), (h, ), ]
serach 46 [(go, ), (ho, ), ]
serach 4663 [(gone, ), (good, ), (home, ), (hood, ), ]
serach 2 [(a, ), ]
serach 22 []
In order to save space, a looking up program for Patricia is also provided.
template<class K, class V>
std::list<std::pair<K, V> > lookup_t9(Patricia<K, V> t,
typename Patricia<K, V>::KeyType key)
{
typedef std::list<std::pair<K, V> > Result;
typedef typename Patricia<K, V>::KeyType Key;
typedef typename Key::value_type Char;
typedef typename Patricia<K, V>::Children::iterator Iterator;
if((!t) | | key.empty())
return Result();
Key prefix;
std::map<Char, Key> m = t9map::inst().map;
std::queue<boost::tuple<Key, Key, Patricia<K, V>> > q;
q.push(boost::make_tuple(prefix, key, t));
Result res;
while(!q.empty()){
boost::tie(prefix, key, t) = q.front();
q.pop();
for(Iterator it=tchildren.begin(); it!=tchildren.end(); ++it){
Key digits = t9map::inst().toT9(itfirst);
if(is_prefix_of(digits, key)){
if(digits == key)
res.push_back(std::make_pair(prefix+itfirst, itsecondvalue));
else{
key =Key(key.begin()+itfirst.size(), key.end());
q.push(boost::make_tuple(prefix+itfirst, key, itsecond));
}
}
}
}
return res;
180CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
}
The program is similar to the one with Trie very much. This is a typical
bread-rst search approach. Note that we added a member function to t9()
to convert a English word/phrase back to digit number string. This member
function is implemented as the following.
struct t9map{
//...
std::string to_t9(std::string s){
std::string res;
for(std::string::iterator c=s.begin(); c!=s.end(); ++c){
for(Map::iterator m=map.begin(); m!=map.end(); ++m){
std::string val = msecond;
if(std::find(val.begin(), val.end(), c)!=val.end()){
res.push_back(mfirst);
break;
}
}
} // skip error handling.
return res;
}
The error handling for invalid letters is omitted in order to keep the code
short for easy understanding. We can use the very similar test cases as above
except we need change the Trie to Patrica. It will output as below.
test t9 lookup in Patricia
search 4 []
serach 46 [(go, ), (ho, ), ]
serach 466 []
serach 4663 [(gone, ), (good, ), ]
serach 2 [(a, ), ]
serach 22 []
The result is slightly dierent, please refer to the Haskell section for the
reason of this dierence. It is very easy to modify the program to output the
very same result as Tries one.
Recursive algorithm of T9 looking up
T9 implemented in Haskell
In Haskell, we rst dene a map from key pad to English letter. When user
input a key pad number sequence, we take each number and check from the
Trie. All children match the number should be investigated. Below is a Haskell
program to realize T9 input.
mapT9 = [(2, "abc"), (3, "def"), (4, "ghi"), (5, "jkl"),
(6, "mno"), (7, "pqrs"), (8, "tuv"), (9, "wxyz")]
lookupT9 :: Char [(Char, b)] [(Char, b)]
lookupT9 c children = case lookup c mapT9 of
Nothing []
Just s foldl f [] s where
5.7. TRIE AND PATRICIA USED IN INDUSTRY 181
f lst x = case lookup x children of
Nothing lst
Just t (x, t):lst
-- T9-find in Trie
findT9:: Trie a String [(String, Maybe a)]
findT9 t [] = [("", Trie.value t)]
findT9 t (k:ks) = foldl f [] (lookupT9 k (children t))
where
f lst (c, tr) = (mapAppend c (findT9 tr ks)) ++ lst
ndT9 is the main function, it takes 2 parameters, a Trie and a number
sequence string. In non-trivial case, it calls lookupT9 function to examine all
children which match the rst number.
For each matched child, the program recursively calls ndT9 on it with the
left numbers, and we use mapAppend to insert the currently nding letter in
front of all results. The program use foldl to combine all these together.
Function lookupT9 is used to ltered all possible children who match a
number. It rst call lookup function on mapT9, so that a string of possible
English letters can be identied. Next we call lookup for each candidate letter
to see if there is a child can match the letter. We use foldl to collect all such
child together.
This program can be veried by using some simple test cases.
testFindT9 = "press 4: " ++ (show $ take 5 $ findT9 t "4")++
"npress 46: " ++ (show $ take 5 $ findT9 t "46")++
"npress 4663: " ++ (show $ take 5 $ findT9 t "4663")++
"npress 2: " ++ (show $ take 5 $ findT9 t "2")++
"npress 22: " ++ (show $ take 5 $ findT9 t "22")
where
t = Trie.fromList lst
lst = [("home", 1), ("good", 2), ("gone", 3), ("hood", 4),
("a", 5), ("another", 6), ("an", 7)]
The program will output below result.
press 4: [("g",Nothing),("h",Nothing)]
press 46: [("go",Nothing),("ho",Nothing)]
press 4663: [("gone",Just 3),("good",Just 2),("home",Just 1),("hood",Just 4)]
press 2: [("a",Just 5)]
press 22: []
The value of each child is just for illustration, we can put empty value instead
and only returns candidate keys for a real input application.
Tries consumes to many spaces, we can provide a Patricia version as alter-
native.
findPrefixT9 :: String [(String, b)] [(String, b)]
findPrefixT9 s lst = filter f lst where
f (k, _) = (toT9 k) Data.List.isPrefixOf s
toT9 :: String String
toT9 [] = []
toT9 (x:xs) = (unmapT9 x mapT9):(toT9 xs) where
182CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
unmapT9 x (p:ps) = if x elem (snd p) then (fst p) else unmapT9 x ps
findT9 :: Patricia a String [(String, Maybe a)]
findT9 t [] = [("", value t)]
findT9 t k = foldl f [] (findPrefixT9 k (children t))
where
f lst (s, tr) = (mapAppend s (findT9 tr (k diff s))) ++ lst
diff x y = drop (length y) x
In this program, we dont check one digit at a time, we take all the digit
sequence, and we examine all children of the Patricia node. For each child, the
program convert the prex string to number sequence by using function toT9, if
the result is prex of what user input, we go on search in this child and append
the prex in front of all further results.
If we tries the same test case, we can nd the result is a bit dierent.
press 4: []
press 46: [("go",Nothing),("ho",Nothing)]
press 466: []
press 4663: [("good",Just 2),("gone",Just 3),("home",Just 1),("hood",Just 4)]
press 2: [("a",Just 5)]
press 22: []
If user press key 4, because the dictionary (represent by Patricia) doesnt
contain any candidates matches it, user will get an empty candidates list. The
same situation happens when he enters 466. In real input method implemen-
tation, such user experience isnt good, because it displays nothing although
user presses the key several times. One improvement is to predict what user
will input next by display a partial result. This can be easily achieved by modify
the above program. (Hint: not only check
findPrefixT9 s lst = filter f lst where
f (k, _) = (toT9 k) Data.List.isPrefixOf s
but also check
f (k, _) = s Data.List.isPrefixOf (toT9 k)
)
T9 implemented in Scheme/Lisp
In Scheme/Lisp, T9 map is dened as a list of pairs.
(define map-T9 (list ("2" "abc") ("3" "def") ("4" "ghi") ("5" "jkl")
("6" "mno") ("7" "pqrs") ("8" "tuv") ("9" "wxyz")))
The main searching function is implemented as the following.
(define (find-T9 t k) ;; return [(key value)]
(define (accumulate-find lst child)
(append (map-string-append (key child) (find-T9 (tree child) (string-cdr k)))
lst))
(define (lookup-child lst c) ;; lst, list of childen [(key tree)], c, char
(let ((res (find-matching-item map-T9 (lambda (x) (string=? c (car x))))))
(if (not res) ()
5.7. TRIE AND PATRICIA USED IN INDUSTRY 183
(filter (lambda (x) (substring? (key x) (cadr res))) lst))))
(if (string-null? k) (list (cons k (value t)))
(fold-left accumulate-find () (lookup-child (children t) (string-car k)))))
This function contains 2 inner functions. If the string is empty, the program
returns a one element list. The element is a string value pair. For the none
trivial case, the program will call inner function to nd in each child and then
put them together by using fold-left high order function.
To test this T9 search function, a very simple dictionary is established by
using Trie insertion. Then we test by calling nd-T9 function on several digits
sequences.
(define dict-T9 (list ("home" ()) ("good" ()) ("gone" ()) ("hood" ())
("a" ()) ("another" ()) ("an" ())))
(define (test-trie-T9)
(define t (listtrie dict-T9))
(display "find4:") (display (find-T9 t "4")) (newline)
(display "find46:") (display (find-T9 t "46")) (newline)
(display "find4663:") (display (find-T9 t "4663")) (newline)
(display "find2:") (display (find-T9 t "2")) (newline)
(display "find22:") (display (find-T9 t "22")) (newline))
Evaluate this test function will output below result.
find 4: ((g) (h))
find 46: ((go) (ho))
find 4663: ((gone) (good) (hood) (home))
find 2: ((a))
find 22: ()
In order to be more space eective, Patricia can be used to replace Trie. The
search program modied as the following.
(define (find-T9 t k)
(define (accumulate-find lst child)
(append (map-string-append (key child) (find-T9 (tree child) (string- k (key child))))
lst))
(define (lookup-child lst k)
(filter (lambda (child) (string-prefix? (strt9 (key child)) k)) lst))
(if (string-null? k) (list (cons "" (value t)))
(fold-left accumulate-find () (lookup-child (children t) k))))
In this program a string helper function string- is dened to get the dierent
part of two strings. It is dened like below.
(define (string- x y)
(string-tail x (string-length y)))
Another function is str-t9 it will convert a alphabetic string back to digit
sequence base on T9 mapping.
(define (strt9 s)
(define (unmap-t9 c)
(car (find-matching-item map-T9 (lambda (x) (substring? c (cadr x))))))
(if (string-null? s) ""
(string-append (unmap-t9 (string-car s)) (strt9 (string-cdr s)))))
184CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
We can feed the almost same test cases, and the result is output as the
following.
find 4: ()
find 46: ((go) (ho))
find 466: ()
find 4663: ((good) (gone) (home) (hood))
find 2: ((a))
find 22: ()
Note the result is a bit dierent, the reason is described in Haskell section.
It is easy to modify the program, so that Trie and Patricia approach give the
very same result.
5.8 Short summary
In this post, we start from the Integer base trie and Patricia, the map data
structure based on integer patricia plays an important role in Compiler im-
plementation. Next, alphabetic Trie and Patricia are given, and I provide a
example implementation to illustrate how to realize a predictive e-dictionary
and a T9 input method. Although they are far from the real implementation in
commercial software. They show a very simple approach of manipulating text.
There are still some interesting problem can not be solved by Trie or Patrica
directly, how ever, some other data structures such as sux tree have close
relationship with them. Ill note something about sux tree in other post.
5.9 Appendix
All programs provided along with this article are free for downloading.
5.9.1 Prerequisite software
GNU Make is used for easy build some of the program. For C++ and ANSI C
programs, GNU GCC and G++ 3.4.4 are used. I use boost triple to reduce the
amount of our code lines, boost library version I am using is 1.33.1. The path
is in CXX variable in Makele, please change it to your path when compiling.
For Haskell programs GHC 6.10.4 is used for building. For Python programs,
Python 2.5 is used for testing, for Scheme/Lisp program, MIT Scheme 14.9 is
used.
all source les are put in one folder. Invoke make or make all will build
C++ and Haskell program.
Run make Haskell will separate build Haskell program. There will be two
executable le generated one is htest the other is happ (with .exe in Window
like OS). Run htest will test functions in IntTrie.hs, IntPatricia.hs, Trie.hs and
Patricia.hs. Run happ will execute the editionary and T9 test cases in EDict.hs.
Run make cpp will build c++ program. It will create a executable le
named cpptest (with .exe in Windows like OS). Run this program will test
inttrie.hpp, intpatricia.hpp, trie.hpp, patricia.hpp, and edict.hpp.
5.9. APPENDIX 185
Run make c will build the ANSI C program for Trie. It will create a
executable le named triec (with .exe in Windows like OS).
Python programs can run directly with interpreter.
Scheme/Lisp program need be loaded into Scheme evaluator and evaluate the
nal function in the program. Note that patricia.scm will hide some functions
dened in trie.scm.
Here is a detailed list of source les
5.9.2 Haskell source les
IntTrie.hs, Haskell version of little-endian integer Trie.
IntPatricia.hs, integer Patricia tree implemented in Haskell.
Trie.hs, Alphabetic Trie, implemented in Haskell.
Patricia.hs, Alphabetic Patricia, implemented in Haskell.
TestMain.hs, main module to test the above 4 programs.
EDict.hs, Haskell program for e-dictionary and T9.
5.9.3 C++/C source les
inttrie.hpp, Integer base Trie;
intpatricia.hpp, Integer based Patricia tree;
trie.c, Alphabetic Trie only for lowercase English language, implemented
in ANSI C.
trie.hpp, Alphabetic Trie;
patricia.hpp, Alphabetic Patricia;
trieutil.hpp, Some generic utilities;
edit.hpp, e-dictionary and T9 implemented in C++;
test.cpp, main program to test all above programs.
5.9.4 Python source les
inttrie.py, Python version of little-endian integer Trie, with test cases;
intpatricia.py, integer Patricia tree implemented in Python;
trie.py, Alphabetic Trie, implemented in Python;
patricia.py, Alphabetic Patricia implemented in Python;
trieutil.py, Common utilities;
edict.py, e-dictionary and T9 implemented in Python.
186CHAPTER 5. TRIE AND PATRICIA WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
5.9.5 Scheme/Lisp source les
inttrie.scm, Little-endian integer Trie, implemented in Scheme/Lisp;
intpatricia.scm, Integer based Patricia tree;
trie.scm, Alphabetic Trie;
patricia.scm, Alphabetic Patricia, reused many denitions in Trie;
trieutil.scm, common functions and utilities.
5.9.6 Tools
Besides them, I use graphviz to draw most of the gures in this post. In order
to translate the Trie, Patrica and Sux Tree output to dot language scripts. I
wrote a python program. it can be used like this.
trie2dot.py -o foo.dot -t patricia "1:x, 4:y, 5:z"
trie2dot.py -o foo.dot -t trie "001:one, 101:five, 100:four"
This helper scripts can also be downloaded with this article.
download position: http://sites.google.com/site/algoxy/trie/trie.zip
Bibliography
[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Cliord
Stein. Introduction to Algorithms, Second Edition. ISBN:0262032937.
The MIT Press. 2001
[2] Chris Okasaki and Andrew Gill. Fast Mergeable Integer Maps. Workshop
on ML, September 1998, pages 77-86, http://www.cse.ogi.edu/ andy/pub-
/nite.htm
[3] D.R. Morrison, PATRICIA Practical Algorithm To Retrieve Information
Coded In Alphanumeric, Journal of the ACM, 15(4), October 1968, pages
514-534.
[4] Sux Tree, Wikipedia. http://en.wikipedia.org/wiki/Sux tree
[5] Trie, Wikipedia. http://en.wikipedia.org/wiki/Trie
[6] T9 (predictive text), Wikipedia. http://en.wikipedia.org/wiki/T9 (predictive text)
[7] Predictive text, Wikipedia. http://en.wikipedia.org/wiki/Predictive text
[8] Bjorn Karlsson. Beyond the C++ Standard Library: An Introduction to
Boost. Addison Wesley Professional, August 31, 2005, ISBN: 0321133544
Sux Tree with Functional and imperative implementation
Larry LIU Xinyu Larry LIU Xinyu
Email: [email protected]
187
188 Sux Tree
Chapter 6
Sux Tree with Functional
and imperative
implementation
6.1 abstract
Sux Tree is an important data structure. It is quite powerful in string and
DNA information manipulations. Sux Tree is introduced in 1973. The lat-
est on-line construction algorithm was found in 1995. This post collects some
existing result of sux tree, including the construction algorithms as well as
some typical applications. Some imperative and functional implementation
are given. There are multiple programming languages used, including C++,
Haskell, Python and Scheme/Lisp.
There may be mistakes in the post, please feel free to point out.
This post is generated by L
A
T
E
X2
of X in SuffixTrie(S
i
) must also have s
i+1
-child.
In other words, let c = s
i+1
, if wc is a sub-string of S
i
, then every sux of wc
is also a sub-string of S
i
[1]. The only exception is root node, which represents
for empty string .
According to this fact, we can rene the algorithm 1 to
Algorithm 2 Revised version of update SuffixTrie(S
i
) to SuffixTrie(S
i+1
).
1: for each node in SuffixTrie(S
i
) in descending order of sux length do
2: if CHILDREN(node)[s
i+1
] = NIL then
3: CHILDREN(node)[s
i+1
] CREATE NEW NODE()
4: else
5: break
The next unclear question is how to iterate all nodes in SuffixTrie(S
i
) in
descending order of sux string length? We can dene the top of a sux Trie as
6.3. SUFFIX TRIE 195
the deepest leaf node, by using sux link for each node, we can traverse sux
Trie until the root. Note that the top of SuffixTrie(NIL) is root, so we can
get a nal version of on-line construction algorithm for sux Trie.
function INSERT(top, c)
if top = NIL then
top CREATE NEW NODE()
node top
node
) CHILDREN(node)[c]
node
CHILDREN(node)[c]
node SUFFIX LINK(node)
if node = NIL then
SUFFIX LINK(node
) CHILDREN(node)[c]
return CHILDREN(top)[c]
The above function INSERT(), can update SuffixTrie(S
i
) to SuffixTrie(S
i+1
).
It receives two parameters, one is the top node of SuffixTrie(S
i
), the other
is the character of s
i+1
. If the top node is NIL, which means that there is no
root node yet, it create the root node then. Compare to the algorithm given by
Ukkonen [1], I use a dummy node node
)
represents to a position which is not an explicit node. By using reference pair,
we can represent every position in a sux Trie for sux tree.
In order to save spaces, Ukkonen found that given a string S all sub-strings
202CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
can be represented as a pair of index (l, r), where l is the left index and r is the
right index of character for the sub-string. For instance, if S = bananas
, and
the index from 1, sub-string na can be represented with pair (3, 4). As the
result, there will be only one copy of the complete string, and all position in a
sux tree will be rened as (node, (l, r)). This is the nal form for reference
pair.
Lets dene the node transfer for sux tree as the following.
CHILDREN(X)[s
l
] ((l, r), Y ) Y (X, (l, r))
If s
i
= c, we say that node X has a c-child. Each node can have at most one
c-child.
canonical reference pair
Its obvious that the one position in a sux tree has multiple reference pairs.
For example, node Y in Figure 6.7 can be either denoted as (X, (3, 4)) or
(root, (2, 4)). And if we dene empty string = (i, i 1), Y can also be repre-
sented as (Y, ).
Ukkonen dened the canonical reference pair as the one which has the clos-
est node to the position. So among the reference pairs of (root, (2, 3)) and
(X, (3, 3)), the latter is the canonical reference pair. Specially, in case a posi-
tion is an explicit node, the canonical reference pair is (node, ), so (Y, ) is the
canonical reference pair of position corresponding to node Y .
Its easy to provide an algorithm to convert a reference pair (node, (l, r)) to
canonical reference pair (node
, (l
, l
) as the result.
Algorithm 3 Convert reference pair to canonical reference pair
1: function CANONIZE(node, (l, r))
2: if node = NIL then
3: if (l, r) = then
4: return (NIL, l)
5: else
6: return CANONIZE(root, (l + 1, r))
7: while l r do
8: ((l
, r
), node
) CHILDREN(node)[s
l
]
9: if r l r
then
10: l l +LENGTH(l
, r
)
11: node node
12: else
13: break
14: return (node, l)
In case the node parameter is NIL, it means a very special case, typically
it is something like the following.
CANONIZE(SUFFIX LINK(root), (l, r))
Because the sux link of root points to NIL, the result should be (root, (l +
1, r)) if (l, r) is not . Else (NIL, ) is returned to indicate a terminal position.
Ill explain this special case in more detail later.
6.4. SUFFIX TREE 203
The algorithm
In 6.4.1, we mentioned, all updating to leaves is trivial, because we only need
append the new coming character to the leaf. With reference pair, it means,
when we update SuffixTree(S
i
) to SuffixTree(S
i+1
), For all reference pairs
with form (node, (l, i)), they are leaves, they will be change to (node, (l, i + 1))
next time. Ukkonen dened leaf as (node, (l, )), here means open to
grow. We can omit all leaves until the sux tree is completely constructed.
After that, we can change all to the length of the string.
So the main algorithm only cares about positions from active point to end
point. However, how to nd the active point and end point?
When we start from the very beginning, there is only a root node, there are
no branches nor leaves. The active point should be (root, ), or (root, (1, 0))
(the string index start from 1).
About the end point, its a position we can nish updating SuffixTree(S
i
).
According to the algorithm for sux Trie, we know it should be a position which
has s
i+1
-child already. Because a position in sux Trie may not be an explicit
node in sux tree. If (node, (l, r)) is the end point, there are two cases.
1. (l, r) = , it means node itself an end point, so node has a s
i+1
-child.
Which means CHILDREN(node)[s
i+1
] = NIL
2. otherwise, l r, end point is an implicit position. It must satisfy s
i+1
=
s
l
+|(l,r)|
, where CHILDREN(node)[s
l
] = ((l
, r
), node
)[s
i
] ((i, ), CREATENEW NODE())
8: SUFFIX LINK(prev) node
9: prev node
, r
), node
) CHILDREN(node)[s
l
]
pos l
+|(l, r)|
if s
pos
= c then
return (TRUE, node)
else
p CREATE NEW NODE()
CHILDREN(node)[s
l
] ((l
, pos 1), p)
CHILDREN(p)[s
pos
] ((pos, r
), node
)
return (FALSE, p)
If the position is (root, ), which means we have gone along sux links to the
root, we return TRUE to indicate the updating can be nished for this round.
If the position is in form of (node, ), it means the reference pair represents an
explicit node, we just test if this node has already c-child, where c = s
i
. and if
not, we can just branching out a leaf from this node.
In other case, which means the position (node, (l, r)) points to a implicit
node. We need nd the exact position next to it to see if it is c-child implicitly.
If yes, we meet a end point, the updating loop can be nished, else, we make
the position an explicit node, and return it for further branching.
With the previous dened CANONIZE() function, we can nalize the
Ukkonens algorithm.
1: function SUFFIX-TREE(S)
2: root CREATE NEW NODE()
3: node root, l 0
4: for i 1 to LENGTH(S) do
5: (node, l) = UPDATE(node, (l, i))
6: (node, l) = CANONIZE(node, (l, i))
7: return root
Figure 6.10 shows the phases when constructing the sux tree for string
cacao with Ukkonens algorithm.
Note that it neednt setup sux link for leaf nodes, only branch nodes have
been set sux links.
Implementation of Ukkonens algorithm in imperative languages
The 2 main features of Ukkonens algorithm are intense use of sux link and
on-line update. So it will be very suitable to implement in imperative language.
Ukkonens algorithm in Python
The node denition is as same as the sux Trie, however, the exact meaning
for children eld are not same.
class Node:
def __init__(self, suffix=None):
self.children = {} # c:(word, Node), where word = (l, r)
206CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
root
empty
c
c
a ca
ca
ac cac
cac
aca caca
caca
a ca o
cao o cao o
cacao
Figure 6.10: Construction of sux tree for cacao, the 6 phases are shown,
only the last layer of sux links are shown in dotted arrow.
self.suffix = suffix
The children for sux tree actually represent to the node transition with
reference pair. if the transition is CHILDREN(node)[s
l
] ((l, r)node
), The
key type of the children is the character, which is corresponding to s
l
, the data;
the data type of the children is the reference pair.
Because there is only one copy of the complete string, all sub-strings are
represent in (left, right) pairs, and the leaf are open pairs as (left, ), so we
provide a tree denition in Python as below.
class STree:
def __init__(self, s):
self.str = s
self.infinity = len(s)+1000
self.root = Node()
The innity is dened as the length of the string plus a big number, well
benet from pythons list[a:b] expression that if the right index exceed to the
length of the list, the result is from left to the end of the list.
For convenience, I provide 2 helper functions for later use.
def substr(str, str_ref):
(l, r)=str_ref
return str[l:r+1]
def length(str_ref):
(l, r)=str_ref
return r-l+1
The main entry for Ukkonens algorithm is implemented as the following.
def suffix_tree(str):
t = STree(str)
node = t.root # init active point is (root, Empty)
6.4. SUFFIX TREE 207
l = 0
for i in range(len(str)):
(node, l) = update(t, node, (l, i))
(node, l) = canonize(t, node, (l, i))
return t
In the main entry, we initialize the tree and let the node points to the root,
at this time point, the active point is (root, ), which is (root, (0, -1)) in Python.
we pass the active point to update() function in a loop from the left most index
to the right most index of the string. Inside the loop, update() function returns
the end point, and we need convert it to canonical reference pair for the next
time update.
the update() function is realized like the following.
def update(t, node, str_ref):
(l, i) = str_ref
c = t.str[i] # current char
prev = Node() # dummy init
while True:
(finish, p) = branch(t, node, (l, i-1), c)
if finish:
break
p.children[c]=((i, t.infinity), Node())
prev.suffix = p
prev = p
# go up along suffix link
(node, l) = canonize(t, node.suffix, (l, i-1))
prev.suffix = node
return (node, l)
Dierent with Ukkonens original program, I didnt use sentinel node. The
reference passed in is (node, (l, i), the active point is (node, (l, i1)) actually, we
passed the active point to branch() function. If it is end point, branch() function
will return true as the rst element in the result. we then terminate the loop
immediately. Otherwise, branch() function will return the node which need to
branch out a new leaf as the second element in the result. The program then
create the new leaf, set it as open pair, and then go up along with sux link.
The prev variable rst point to a dummy node, this can simplify the logic, and
it used to record the position along the boundary path. by the end of the loop,
well nish the last updating of the sux link and return the end point. Since
the end point is always in form of (node, (l, i-1)), only (node, l) is returned.
Function branch() is used to test if a position is the end point and turn the
implicit node to explicit node if necessary.
def branch(t, node, str_ref, c):
(l, r) = str_ref
if length(str_ref)0: # (node, empty)
if node is None: #_| _
return (True, t.root)
else:
return ((c in node.children), node)
else:
((l1, r1), node1) = node.children[t.str[l]]
pos = l1+length(str_ref)
208CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
if t.str[pos]==c:
return (True, node)
else: # node--branch_node--node1
branch_node = Node()
node.children[t.str[l1]]=((l1, pos-1), branch_node)
branch_node.children[t.str[pos]] = ((pos, r1), node1)
return (False, branch_node)
Because I dont use sentinel node, the special case is handled in the rst
if-clause.
The canonize() function helps to convert a reference pair to canonical refer-
ence pair.
def canonize(t, node, str_ref):
(l, r) = str_ref
if node is None:
if length(str_ref)0:
return (None, l)
else:
return canonize(t, t.root, (l+1, r))
while lr: # str_ref is not empty
((l1, r1), child) = node.children[t.str[l]] # node--(l, r)-child
if r-l r1-l1: #node--(l,r)-child--...
l += r1-l1+1 # remove | (l,r)| chars from (l, r)
node = child
else:
break
return (node, l)
Before testing the sux tree construction algorithm, some helper functions
to convert the sux tree to human readable string are given.
def to_lines(t, node):
if len(node.children)==0:
return [""]
res = []
for c, (str_ref, tr) in sorted(node.children.items()):
lines = to_lines(t, tr)
edge_str = substr(t.str, str_ref)
lines[0] = "| --"+edge_str+"-"+lines[0]
if len(node.children)>1:
lines[1:] = map(lambda l: "| "+""(len(edge_str)+5)+l, lines[1:])
else:
lines[1:] = map(lambda l: ""+""(len(edge_str)+6)+l, lines[1:])
if res !=[]:
res.append("| ")
res += lines
return res
def to_str(t):
return "n".join(to_lines(t, t.root))
They are quite similar to the helper functions for sux Trie print. The
dierent part is mainly cause by the reference pair of string.
In order to verify the implementation, some very simple test cases are feed
to the algorithm as below.
6.4. SUFFIX TREE 209
class SuffixTreeTest:
def __init__(self):
print "startsuffixtreetest"
def run(self):
strs = ["cacao", "mississippi", "banana$"] #$ special terminator
for s in strs:
self.test_build(s)
def test_build(self, str):
for i in range(len(str)):
self.__test_build(str[:i+1])
def __test_build(self, str):
print "SuffixTree("+str+"):n", to_str(suffix_tree(str)),"n"
Here is a result snippet which shows the construction phases for string ca-
cao
Suffix Tree (c):
|--c-->
Suffix Tree (ca):
|--a-->
|
|--ca-->
Suffix Tree (cac):
|--ac-->
|
|--cac-->
Suffix Tree (caca):
|--aca-->
|
|--caca-->
Suffix Tree (cacao):
|--a-->|--cao-->
| |
| |--o-->
|
|--ca-->|--cao-->
| |
| |--o-->
|
|--o-->
The result is identical to the one which is shown in gure 6.10.
210CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Ukkonens algorithm in C++
Ukkonens algorithm has much use of pairs, including reference pair and sub-
string index pair. Although STL provided std::pair tool, it is lack of variable
binding ability, for example (x, y)=apair like assignment isnt legal C++ code.
boost::tuple provides a handy tool, tie(). Ill give a mimic tool like boost::tie so
that we can bind two variables to a pair.
template<typename T1, typename T2>
struct Bind{
Bind(T1& r1, T2& r2):x1(r1), x2(r2){}
Bind(const Bind& r):x1(r.x1), x2(r.x2){}
// Support implicit type conversion
template<typename U1, typename U2>
Bind& operator=(const std::pair<U1, U2>& p){
x1 = p.first;
x2 = p.second;
return this;
}
T1& x1;
T2& x2;
};
template<typename T1, typename T2>
Bind<T1, T2> tie(T1& r1, T2& r2){
return Bind<T1, T2>(r1, r2);
}
With this tool, we can tie variables like the following.
int l, r;
tie(l, r) = str_pair;
We dene sub-string index pair and reference pair like below. First is string
index reference pair.
struct StrRef: public std::pair<int, int>{
typedef std::pair<int, int> Pair;
static std::string str;
StrRef():Pair(){}
StrRef(int l, int r):Pair(l, r){}
StrRef(const Pair& ref):Pair(ref){}
std::string substr(){
int l, r;
tie(l, r) = this;
return str.substr(l, len());
}
int len(){
int l, r;
tie(l, r) = this;
return r-l+1;
}
};
6.4. SUFFIX TREE 211
std::string StrRef::str="";
Because there is only one copy of the complete string, a static variable is
used to store it. substr() function is used to convert a pair of left, right index
into the sub-string. Function len() is used to calculate the length of a sub-string.
Ukkonens reference pair is dened in the same way.
struct Node;
struct RefPair: public std::pair<Node, StrRef>{
typedef std::pair<Node, StrRef> Pair;
RefPair():Pair(){}
RefPair(Node n, StrRef s):Pair(n, s){}
RefPair(const Pair& p):Pair(p){}
Node node(){ return first; }
StrRef str(){ return second; }
};
With these denition, the node type of sux tree can be dened.
struct Node{
typedef std::string::value_type Key;
typedef std::map<Key, RefPair> Children;
Node():suffix(0){}
~Node(){
for(Children::iterator it=children.begin();
it!=children.end(); ++it)
delete itsecond.node();
}
Children children;
Node suffix;
};
The children of a node is dened as a map with reference pairs stored. In
order for easy memory management, a recursive approach is used.
The nal sux tree is dened with a string and a root node.
struct STree{
STree(std::string s):str(s),
infinity(s.length()+1000),
root(new Node)
{ StrRef::str = str; }
~STree() { delete root; }
std::string str;
int infinity;
Node root;
};
the innity is dened as the length of the string plus a big number. Innity
will be used for leaf node as the open to append meaning.
Next is the main entry of Ukkonens algorithm.
STree suffix_tree(std::string s){
212CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
STree t=new STree(s);
Node node = troot; // init active point as (root, empty)
for(unsigned int i=0, l=0; i<s.length(); ++i){
tie(node, l) = update(t, node, StrRef(l, i));
tie(node, l) = canonize(t, node, StrRef(l, i));
}
return t;
}
The program start from the initialized active point, and repeatedly call up-
date(), the returned end point will be canonized and used for the next active
point.
Function update() is implemented as below.
std::pair<Node, int> update(STree t, Node node, StrRef str){
int l, i;
tie(l, i)=str;
Node::Key c(tstr[i]); //current char
Node dummy, p;
Node prev(&dummy);
while((p=branch(t, node, StrRef(l, i-1), c))!=0){
pchildren[c]=RefPair(new Node(), StrRef(i, tinfinity));
prevsuffix = p;
prev = p;
// go up along suffix link
tie(node, l) = canonize(t, nodesuffix, StrRef(l, i-1));
}
prevsuffix = node;
return std::make_pair(node, l);
}
In this function, pair (node, (l, i-1)) is the real active point position. It is
fed to branch() function. If the position is end point, branch will return NULL
pointer, so the while loop terminates. else a node for branching out a new leaf
is returned. Then the program will go up along with sux links and update the
previous sux link accordingly. The end point will be returned as the result of
this function.
Function branch() is implemented as the following.
Node branch(STree t, Node node, StrRef str, Node::Key c){
int l, r;
tie(l, r) = str;
if(str.len()0){ // (node, empty)
if(node && nodechildren.find(c)==nodechildren.end())
return node;
else
return 0; // either node is empty (_| _), or is EP
}
else{
RefPair rp = nodechildren[tstr[l]];
int l1, r1;
tie(l1, r1) = rp.str();
int pos = l1+str.len();
if(tstr[pos]==c)
return 0;
6.4. SUFFIX TREE 213
else{ // node--branch_node--node1
Node branch_node = new Node();
nodechildren[tstr[l1]]=RefPair(branch_node, StrRef(l1, pos-1));
branch_nodechildren[tstr[pos]] = RefPair(rp.node(), StrRef(pos, r1));
return branch_node;
}
}
}
If the position is (NULL, empty), it means the program arrive at the root
position, NULL pointer is returned to indicate the updating can be terminated.
If the position is in form of (node, ), it then check if the node has already a
s
i
-child. In other case, it means the position point to a implicit node, we need
extra process to test if it is end point. If not, a splitting happens to convert the
implicit node to explicit one.
The function to canonize a reference pair is given below.
std::pair<Node, int> canonize(STree t, Node node, StrRef str){
int l, r;
tie(l, r)=str;
if(!node)
if(str.len()0)
return std::make_pair(node, l);
else
return canonize(t, troot, StrRef(l+1, r));
while(lr){ //str isntempty
RefPairrp=nodechildren[tstr[l]];
intl1,r1;
tie(l1,r1)=rp.str();
if(r-lr1-l1){
l+=rp.str().len();//removelen()from(l,r)
node=rp.node();
}
else
break;
}
returnstd::make_pair(node,l);
}
In order to test the program, some helper functions are provided to represent
the sux tree as string. Among them, some are very common tools.
// map (x+) coll in Haskell
// boost lambda: transform(first, last, first, x+_1)
template<class Iter, class T>
void map_add(Iter first, Iter last, T x){
std::transform(first, last, first,
std::bind1st(std::plus<T>(), x));
}
// x ++ y in Haskell
template<class Coll>
void concat(Coll& x, Coll& y){
std::copy(y.begin(), y.end(),
std::insert_iterator<Coll>(x, x.end()));
}
214CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
map add() will add a value to every element in a collection. concat can
concatenate tow collections together.
the sux tree to string function is nally provided like this.
std::list<std::string> to_lines(Node node){
typedef std::list<std::string> Result;
Result res;
if(nodechildren.empty()){
res.push_back("");
return res;
}
for(Node::Children::iterator it = nodechildren.begin();
it!=nodechildren.end(); ++it){
RefPair rp = itsecond;
Result lns = to_lines(rp.node());
std::string edge = rp.str().substr();
lns.begin() = "| --" + edge + "-" + (lns.begin());
map_add(++lns.begin(), lns.end(),
std::string("| ")+std::string(edge.length()+5, ));
if(!res.empty())
res.push_back("| ");
concat(res, lns);
}
return res;
}
std::string to_str(STree t){
std::list<std::string> ls = to_lines(troot);
std::ostringstream s;
std::copy(ls.begin(), ls.end(),
std::ostream_iterator<std::string>(s, "n"));
return s.str();
}
After that, the program can be veried by some simple test cases.
class SuffixTreeTest{
public:
SuffixTreeTest(){
std::cout<<"Startsuffixtreetestn";
}
void run(){
test_build("cacao");
test_build("mississippi");
test_build("banana$"); //$ as special terminator
}
private:
void test_build(std::string str){
for(unsigned int i=0; i<str.length(); ++i)
test_build_step(str.substr(0, i+1));
}
void test_build_step(std::string str){
STree t = suffix_tree(str);
std::cout<<"SuffixTree("<<str<<"):n"
6.4. SUFFIX TREE 215
<<to_str(t)<<"n";
delete t;
}
};
Below are snippet of sux tree construction phases.
Suffix Tree (b):
|--b-->
Suffix Tree (ba):
|--a-->
|
|--ba-->
Suffix Tree (ban):
|--an-->
|
|--ban-->
|
|--n-->
Suffix Tree (bana):
|--ana-->
|
|--bana-->
|
|--na-->
Suffix Tree (banan):
|--anan-->
|
|--banan-->
|
|--nan-->
Suffix Tree (banana):
|--anana-->
|
|--banana-->
|
|--nana-->
Suffix Tree (banana$):
|--$-->
|
|--a-->|--$-->
| |
| |--na-->|--$-->
| | |
| | |--na$-->
216CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
|
|--banana$-->
|
|--na-->|--$-->
| |
| |--na$-->
Functional algorithm for sux tree construction
Ukkonens algorithm is in a manner of on-ling updating and sux link plays
very important role. Such properties cant be realized in a functional approach.
Giegerich and Kurtz found Ukkonens algorithm can be transformed to Mc-
Creights algorithm[7]. These two and the algorithm found by Weiner are all
O(n)-algorithms. They also conjecture (although it isnt proved) that any se-
quential sux tree construction not base on the important concepts, such as
sux links, active suxes, etc., will fail to meet O(n)-criterion.
There is implemented in PLT/Scheme[10] based on Ukkonens algorithm,
However it updates sux links during the processing, so it is not pure functional
approach.
A lazy sux tree construction method is discussed in [8]. And this method
is contribute to Haskell Hackage by Bryan OSullivan. [9].
This method benet from the lazy evaluation property of Haskell program-
ming languages, so that the tree wont be constructed until it is traversed.
However, I think it is still a kind of brute-force method. In other functional
programming languages such as ML. It cant be O(n) algorithm.
I will provide a pure brute-force implementation which is similar but not
100% same in this post.
Brute-force sux tree construction in Haskell
For brute-force implementation, we neednt sux link at all. The denition of
sux node is plain straightforward.
data Tr = Lf | Br [(String, Tr)] deriving (Eq, Show)
type EdgeFunc = [String](String, [String])
The edge function plays interesting role. it takes a list of strings, and extract
a prex of these strings, the prex may not be the longest prex, and empty
string is also possible. Whether extract the longest prex or just return them
trivially can be dened by dierent edge functions.
For easy implementation, we limit the character set as below.
alpha = [a..z]++[A..Z]
This is only for illustration purpose, only the English lower case and upper
case letters are included. We can of course includes other characters if necessary.
The core algorithm is given in list comprehension style.
lazyTree::EdgeFunc [String] Tr
lazyTree edge = build where
build [[]] = Lf
build ss = Br [(a:prefix, build ss) |
aalpha,
xs@(x:_) [[cs | c:csss, c==a]],
6.4. SUFFIX TREE 217
(prefix, ss)[edge xs]]
lazyTree function takes a list of string, it will generate a radix tree (for
example a Trie or a Patricia) from these string.
It will categorize all strings with the rst letter in several groups, and remove
the rst letter for each elements in every group. For example, for the string list
[acac, cac, ac, c] the categorized group will be [(a, [cac, c]), (c,
[ac, ])]. For easy understanding, I left the rst letter, and write the groups
as tuple. then all strings with same rst letter (removed) will be fed to edge
function.
Dierent edge function produce dierent radix trees. The most trivial one
will build a Trie.
edgeTrie::EdgeFunc
edgeTrie ss = ("", ss)
If the edge function extract the longest common prex, then it will build a
Patricia.
-- ex:
-- edgeTree ["an", "another", "and"] = ("an", ["", "other", "d"])
-- edgeTree ["bool", "foo", "bar"] = ("", ["bool", "foo", "bar"])
--
-- some helper comments
-- let awss@((a:w):ss) = ["an", "another", "and"]
-- (a:w) = "an", ss = ["another", "and"]
-- a=a, w="n"
-- rests awss = w:[u| _:uss] = ["n", "nother", "nd"]
--
edgeTree::EdgeFunc
edgeTree [s] = (s, [[]])
edgeTree awss@((a:w):ss) | null [c| c:_ss, a/=c] = (a:prefix, ss)
| otherwise = ("", awss)
where (prefix, ss) = edgeTree (w:[u| _:uss])
edgeTree ss = ("", ss) -- (a:w):ss cant be match head ss == ""
We can build sux Trie and sux tree with the above two functions.
suffixTrie::StringTr
suffixTrie = lazyTree edgeTrie tails -- or init tails
suffixTree::StringTr
suffixTree = lazyTree edgeTree tails
Below snippet is the result of constructing sux Trie/tree for string mis-
sissippi.
SuffixTree("mississippi")=Br [("i",Br [("ppi",Lf),("ssi",Br [("ppi",Lf),
("ssippi",Lf)])]),("mississippi",Lf),("p",Br [("i",Lf),("pi",Lf)]),("s",
Br [("i",Br [("ppi",Lf),("ssippi",Lf)]),("si",Br [("ppi",Lf),("ssippi",
Lf)])])]
SuffixTrie("mississippi")=Br [("i",Br [("p",Br [("p",Br [("i",Lf)])]),
("s",Br [("s",Br [("i",Br [("p",Br [("p",Br [("i",Lf)])]),("s",Br [("s",
Br [("i",Br [("p",Br [("p",Br [("i",Lf)])])])])])])])])]),("m",Br [("i",
Br [("s",Br [("s",Br [("i",Br [("s",Br [("s",Br [("i",Br [("p",Br [("p",
Br [("i",Lf)])])])])])])])])])]),("p",Br [("i",Lf),("p",Br [("i",Lf)])]),
218CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
("s",Br [("i",Br [("p",Br [("p",Br [("i",Lf)])]),("s",Br [("s",Br [("i",
Br [("p",Br [("p",Br [("i",Lf)])])])])])]),("s",Br [("i",Br [("p",Br
[("p",Br [("i",Lf)])]),("s",Br [("s",Br [("i",Br [("p",Br [("p",Br
[("i",Lf)])])])])])])])])]
Function lazyTree is common for all radix trees, the normal Patricia and
Trie can also be constructed with it.
trie::[String]Tr
trie = lazyTree edgeTrie
patricia::[String]Tr
patricia = lazyTree edgeTree
Lets test it with some simple cases.
trie ["zoo", "bool", "boy", "another", "an", "a"]
patricia ["zoo", "bool", "boy", "another", "an", "a"]
The results are as below.
Br [("a",Br [("n",Br [("o",Br [("t",Br [("h",Br [("e",
Br [("r",Lf)])])])])])]),("b",Br [("o",Br [("o",Br [("l",Lf)]),
("y",Lf)])]),("z",Br [("o",Br [("o",Lf)])])]
Br [("another",Lf),("bo",Br [("ol",Lf),("y",Lf)]),("zoo",Lf)]
This is reason why I think the method is brute-force.
Brute-force sux tree construction in Scheme/Lisp
.
The Functional implementation in Haskell utilizes list comprehension, which
is a handy syntax tool. In Scheme/Lisp, we use functions instead.
In MIT scheme, there are special functions to manipulate strings, which is
a bit dierent from list. Below are helper functions to simulate car and cdr
function for string.
(define (string-car s)
(if (string=? s "")
""
(string-head s 1)))
(define (string-cdr s)
(if (string=? s "")
""
(string-tail s 1)))
The edge functions will extract common prex from a list of strings. For
Trie, only the rst common character will be extracted.
;; (edge-trie ("an" "another" "and"))
;; = ("a" "n" "nother" "nd")
(define (edge-trie ss)
(cons (string-car (car ss)) (map string-cdr ss)))
While for sux tree, we need extract the longest common prex.
6.4. SUFFIX TREE 219
;; (edge-tree ("an" "another" "and"))
;; = ("an" "" "other" "d")
(define (edge-tree ss)
(cond ((= 1 (length ss)) (cons (car ss) ()))
((prefix? ss)
(let ((res (edge-tree (map string-cdr ss)))
(prefix (car res))
(ss1 (cdr res)))
(cons (string-append (string-car (car ss)) prefix) ss1)))
(else (cons "" ss))))
;; test if a list of strings has common prefix
;; (prefix ("an" "another" "and")) = true
;; (prefix ("" "other" "d")) = false
(define (prefix? ss)
(if (null? ss)
()
(let ((c (string-car (car ss))))
(null? (filter (lambda (s) (not (string=? c (string-car s))))
(cdr ss))))))
For some old version of MIT scheme, there isnt denition for partition
function, so I dened one like below.
;; overwrite the partition if not support SRFI 1
;; (partition (> 5) (1 6 2 7 3 9 0))
;; =((6 7 9) 1 2 3 0)
(define (partition pred lst)
(if (null? lst)
(cons () ())
(let ((res (partition pred (cdr lst))))
(if (pred (car lst))
(cons (cons (car lst) (car res)) (cdr res))
(cons (car res) (cons (car lst) (cdr res)))))))
Function groups can group a list of strings based on the rst letter of each
string.
;; group a list of strings based on first char
;; ss shouldnt contains "" string, so filter should be done first.
;; (groups ("an" "another" "bool" "and" "bar" "c"))
;; = (("an" "another" "and") ("bool" "bar") ("c"))
(define (groups ss)
(if (null? ss)
()
(let ((c (string-car (car ss)))
(res (partition (lambda (x) (string=? c (string-car x))) (cdr ss))))
(append (list (cons (car ss) (car res)))
(groups (cdr res))))))
Function remove-empty can remove the empty string from the string list.
(define (remove-empty ss)
(filter (lambda (s) (not (string=? "" s))) ss))
With all the above tools, the core brute-force algorithm can be implemented
like the following.
220CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
(define (make-tree edge ss)
(define (bld-group grp)
(let ((res (edge grp))
(prefix (car res))
(ss1 (cdr res)))
(cons prefix (make-tree edge ss1))))
(let ((ss1 (remove-empty ss)))
(if (null? ss1) ()
(map bld-group (groups ss1)))))
The nal sux tree and sux Trie construction algorithms can be given.
(define (suffix-tree s)
(make-tree edge-tree (tails s)))
(define (suffix-trie s)
(make-tree edge-trie (tails s)))
Below snippet are quick verication of this program.
(suffix-trie "cacao")
;Value 66: (("c" ("a" ("c" ("a" ("o"))) ("o"))) ("a" ("c" ("a" ("o"))) ("o")) ("o"))
(suffix-tree "cacao")
;Value 67: (("ca" ("cao") ("o")) ("a" ("cao") ("o")) ("o"))
6.5 Sux tree applications
Sux tree can help to solve a lot of string/DNA manipulation problems partic-
ularly fast. For typical problems will be list in this section.
6.5.1 String/Pattern searching
There a plenty of string searching problems, among them includes the famous
KMP algorithm. Sux tree can perform same level as KMP[11], that string
searching in O(m) complexity, where m is the length of the sub-string.However,
O(n) time is required to build the sux tree in advance[12].
Not only sub-string searching, but also pattern matching, including regular
expression matching can be solved with sux tree. Ukkonen summarize this
kind of problems as sub-string motifs, and he gave the result that For a string
S, SuffixTree(S) gives complete occurrence counts of all sub-string motifs of
S in O(n) time, although S may have O(n
2
) sub-strings.
Note the facts of a SuffixTree(S) that all internal nodes is corresponding
to a repeating sub-string of S and the number of leaves of the sub-tree of a node
for string P is the number of occurrence of P in S.[13]
Algorithm of nding the number of sub-string occurrence
The algorithm is almost as same as the Patricia looking up algorithm, please
refer to [5] for detail, the only dierence is that the number of the children is
returned when a node matches the pattern.
6.5. SUFFIX TREE APPLICATIONS 221
Find number of sub-string occurrence in Python
In Ukkonens algorithm, there is only one copy of string, and all edges are
represent with index pairs. There are some changes because of this reason.
def lookup_pattern(t, node, s):
f = (lambda x: 1 if x==0 else x)
while True:
match = False
for _, (str_ref, tr) in node.children.items():
edge = t.substr(str_ref)
if string.find(edge, s)==0: #s isPrefixOf edge
return f(len(tr.children))
elif string.find(s, edge)==0: #edge isPrefixOf s
match = True
node = tr
s = s[len(edge):]
break
if not match:
return 0
return 0 # not found
In case a branch node matches the pattern, it means there is at least one
occurrence even if the number of children is zero. Thats why a local lambda
function is dened.
I added a member function in STree to convert a string index pair to string
as below.
class STree:
#...
def substr(self, sref):
return substr(self.str, sref)
In lookup pattern() function, it takes a sux tree which is built from the
string. A node is passed as the position to be looked up, it is root node when
starting. Parameter s is the string to be searched.
The algorithm iterate all children of the node, it convert the string index
reference pair to edge sub-string, and check if s is prex of the the edge string,
if matches, then the program can be terminated, the number of the branches of
this node will be returned as the number of the occurrence of this sub-string.
Note that no branch means there is only 1 occurrence. In case the edge is
prex of s, we then updated the node and string to be searched and go on the
searching.
Because construction of the sux tree is expensive, so we only do it when
necessary. We can do a lazy initialization as below.
TERM1 = $ # $: special terminator
class STreeUtil:
def __init__(self):
self.tree = None
def find_pattern(self, str, pattern):
if self.tree is None or self.tree.str!=str+TERM1:
self.tree = stree.suffix_tree(str+TERM1)
return lookup_pattern(self.tree, self.tree.root, pattern)
222CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
We always append special terminator to the string, so that there wont be
any sux becomes the prex of the other[2].
Some simple test cases are given to verify the program.
class StrSTreeTest:
def run(self):
self.test_find_pattern()
def test_find_pattern(self):
util=STreeUtil()
self.__test_pattern__(util, "banana", "ana")
self.__test_pattern__(util, "banana", "an")
self.__test_pattern__(util, "banana", "anan")
self.__test_pattern__(util, "banana", "nana")
self.__test_pattern__(util, "banana", "ananan")
def __test_pattern__(self, u, s, p):
print "findpattern", p, "in", s, ":", u.find_pattern(s, p)
And the output is like the following.
find pattern ana in banana : 2
find pattern an in banana : 2
find pattern anan in banana : 1
find pattern nana in banana : 1
find pattern ananan in banana : 0
Find the number of sub-string occurrence in C++
In C++, do-while is used as the repeat-until structure, the program is almost
as same as the standard Patricia looking up function.
int lookup_pattern(const STree t, std::string s){
Node node = troot;
bool match(false);
do{
match=false;
for(Node::Children::iterator it = nodechildren.begin();
it!=nodechildren.end(); ++it){
RefPair rp = itsecond;
if(rp.str().substr().find(s)==0){
int res = rp.node()children.size();
return res == 0? 1 : res;
}
else if(s.find(rp.str().substr())==0){
match = true;
node = rp.node();
s = s.substr(rp.str().substr().length());
break;
}
}
}while(match);
return 0;
}
6.5. SUFFIX TREE APPLICATIONS 223
An utility class is dened and it support lazy initialization to save the cost
of construction of sux tree.
class STreeUtil{
public:
STreeUtil():t(0){}
~STreeUtil(){ delete t; }
int find_pattern(std::string s, std::string pattern){
lazy(s);
return lookup_pattern(t, pattern);
}
private:
void lazy(std::string s){
if((!t) | | tstr != s+TERM1){
delete t;
t = suffix_tree(s+TERM1);
}
}
STree t;
};
The same test cases can be feed to this C++ program.
class StrSTreeTest{
public:
void test_find_pattern(){
__test_pattern("banana", "ana");
__test_pattern("banana", "an");
__test_pattern("banana", "anan");
__test_pattern("banana", "nana");
__test_pattern("banana", "ananan");
}
pivate:
void __test_pattern(std::string s, std::string ptn){
std::cout<<"findpattern"<<ptn<<"in"<<s<<":"
<<util.find_pattern(s, ptn)<<"n";
}
STreeUtil util;
};
And the same result will be obtained like the Python program.
Find the number of sub-string occurrence in Haskell
The Haskell program is just turn the looking up into recursive way.
lookupPattern :: Tr String Int
lookupPattern (Br lst) ptn = find lst where
find [] = 0
find ((s, t):xs)
| ptn isPrefixOf s = numberOfBranch t
| s isPrefixOf ptn = lookupPattern t (drop (length s) ptn)
| otherwise = find xs
numberOfBranch (Br ys) = length ys
224CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
numberOfBranch _ = 1
findPattern :: String String Int
findPattern s ptn = lookupPattern (suffixTree $ s++"$") ptn
To verify it, the test cases are fed to the program as the following
testPattern = ["find pattern "++p++" in banana: "++
(show $ findPattern "banana" p)
| p ["ana", "an", "anan", "nana", "anana"]]
Launching GHCi, evaluate the instruction can output the same result as the
above programs.
putStrLn $ unlines testPattern
Find the number of sub-string occurrence in Scheme/Lisp
Because the underground data structure of sux tree is list in Scheme/Lisp
program, we neednt dene a inner nd function as in Haskell program.
(define (lookup-pattern t ptn)
(define (number-of-branches node)
(if (null? node) 1 (length node)))
(if (null? t) 0
(let ((s (edge (car t)))
(tr (children (car t))))
(cond ((string-prefix? ptn s)(number-of-branches tr))
((string-prefix? s ptn)
(lookup-pattern tr (string-tail ptn (string-length s))))
(else lookup-pattern (cdr t) ptn)))))
The test cases are fed to this program via a list.
(define (test-pattern)
(define (test-ptn t s)
(cons (string-append "findpattern" s "inbanana" )
(lookup-pattern t s)))
(let ((t (suffix-tree "banana")))
(map (lambda (x) (test-ptn t x)) ("ana" "an" "anan" "nana" "anana"))))
Evaluate this test function can generate a result list as the following.
(test-pattern)
;Value 16: (("find pattern ana in banana" "ana") ("find pattern an in banana" "an") ("find pattern anan in banana" "anan") ("find pattern nana in banana" "nana") ("find pattern anana in banana" "anana"))
Complete pattern search
For search pattern like a**n with sux tree, please refer to [13] and [14].
6.5.2 Find the longest repeated sub-string
If we go one step ahead from 6.5.1, below result can be found.
After adding a special terminator character to string S, The longest repeated
sub-string can be found by searching the deepest branches in sux tree.
Consider the example sux tree shown in gure 6.11
6.5. SUFFIX TREE APPLICATIONS 225
$ i mississippi$ p s
$ ppi$
A
ssi
ppi$ ssippi$
i$ pi$
B
i
C
si
ppi$ ssippi$ ppi$ ssippi$
Figure 6.11: The sux tree for mississippi$
There are 3 branch nodes, A, B, and C which depth is 3. However, A
represents the longest repeated sub-string issi. B and C represent for si,
ssi, they are all shorter than A.
This example tells us that the depth of the branch node should be mea-
sured by the number of characters traversed from the root.
Find the longest repeated sub-string in imperative approach
According to the above analysis, to nd the longest repeated sub-string can be
turned into a BFS (Bread First Search) in a sux tree.
1: function LONGEST-REPEATED-SUBSTRING(T)
2: Q (NIL, ROOT(T))
3: R NIL
4: while Q is not empty do
5: (s, node) POP(Q)
6: for each ((l, r), node
) in CHILDREN(node) do
7: if node
, node
))
10: UPDATE(R, s
)
11: return R
where algorithm UPDATE() will compare the longest repeated sub-string
candidates. If two candidates have the same length, one simple solution is just
take one as the nal result, the other solution is to maintain a list contains all
candidates with same length.
1: function UPDATE(l, x)
2: if l = NIL or LENGTH(l[1]) < LENGTH(x) then
3: return l [x]
4: else if LENGTH(l[1]) = LENGTH(x) then
5: return APPEND(l, x)
226CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Note that the index of a list starts from 1 in this algorithm. This algorithm
will rst initialize a queue with a pair of an empty string and the root node.
Then it will repeatedly pop from the queue, examine the candidate node until
the queue is empty.
For each node, the algorithm will expand all children, if it is a branch node
(which is not a leaf), the node will be pushed back to the queue for future
examine. And the sub-string represented by this node will be compared to see
if it is a candidate of the longest repeated sub-string.
Find the longest repeated sub-string in Python
The above algorithm can be translated into Python program as the following.
def lrs(t):
queue = [("", t.root)]
res = []
while len(queue)>0:
(s, node) = queue.pop(0)
for _, (str_ref, tr) in node.children.items():
if len(tr.children)>0:
s1 = s+t.substr(str_ref)
queue.append((s1, tr))
res = update_max(res, s1)
return res
def update_max(lst, x):
if lst ==[] or len(lst[0]) < len(x):
return [x]
elif len(lst[0]) == len(x):
return lst + [x]
else:
return lst
In order to verify this program, some simple test cases are fed.
class StrSTreeTest:
#...
def run(self):
#...
self.test_lrs()
def test_lrs(self):
self.__test_lrs__("mississippi")
self.__test_lrs__("banana")
self.__test_lrs__("cacao")
self.__test_lrs__("foofooxbarbar")
def __test_lrs__(self, s):
print "longestrepeatedsubstringsof", s, "=", self.util.find_lrs(s)
By running the test case, the result like below can be obtained.
longest repeated substrings of mississippi = [issi]
longest repeated substrings of banana = [ana]
6.5. SUFFIX TREE APPLICATIONS 227
longest repeated substrings of cacao = [ca]
longest repeated substrings of foofooxbarbar = [bar, foo]
Find the longest repeated sub-string in C++
With C++, we can utilize the STL queue library in the implementation of
BFS(Bread First Search).
typedef std::list<std::string> Strings;
Strings lrs(const STree t){
std::queue<std::pair<std::string, Node> > q;
Strings res;
q.push(std::make_pair(std::string(""), troot));
while(!q.empty()){
std::string s;
Node node;
tie(s, node) = q.front();
q.pop();
for(Node::Children::iterator it = nodechildren.begin();
it!=nodechildren.end(); ++it){
RefPair rp = itsecond;
if(!(rp.node()children.empty())){
std::string s1 = s + rp.str().substr();
q.push(std::make_pair(s1, rp.node()));
update_max(res, s1);
}
}
}
return res;
}
Firstly, the empty string and root node is pushed to the queue as initialized
value. Then the program repeatedly pop from the queue, examine it to see if
any child of the node is not a leaf node, push it back to the queue and check if
it is the deepest one.
The function update max() is implemented to record all the longest strings.
void update_max(Strings& res, std::string s){
if(res.empty() | | (res.begin()).length() < s.length()){
res.clear();
res.push_back(s);
return;
}
if((res.begin()).length() == s.length())
res.push_back(s);
}
Since the cost of construction a sux tree is big (O(n) with Ukkonens
algorithm), some lazy initialization approach is used in the main entrance of
the nding program.
const char TERM1 = $;
class STreeUtil{
public:
228CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
STreeUtil():t(0){}
~STreeUtil(){ delete t; }
Strings find_lrs(std::string s){
lazy(s);
return lrs(t);
}
private:
void lazy(std::string s){
if((!t) | | tstr != s+TERM1){
delete t;
t = suffix_tree(s+TERM1);
}
}
STree t;
};
In order to verify the program, some test cases are provided. output for list
of strings can be easily realized by overloading operator.
class StrSTreeTest{
public:
StrSTreeTest(){
std::cout<<"startstringmanipulationoversuffixtreetestn";
}
void run(){
test_lrs();
}
void test_lrs(){
__test_lrs("mississippi");
__test_lrs("banana");
__test_lrs("cacao");
__test_lrs("foofooxbarbar");
}
private:
void __test_lrs(std::string s){
std::cout<<"longestrepeatedsubstirngof"<<s<<"="
<<util.find_lrs(s)<<"n";
}
STreeUtil util;
};
Running these test cases, we can obtain the following result.
start string manipulation over suffix tree test
longest repeated substirng of mississippi=[issi, ]
longest repeated substirng of banana=[ana, ]
longest repeated substirng of cacao=[ca, ]
longest repeated substirng of foofooxbarbar=[bar, foo, ]
6.5. SUFFIX TREE APPLICATIONS 229
Find the longest repeated sub-string in functional approach
Searching the deepest branch can also be realized in functional way. If the tree
is just a leaf node, empty string is returned, else the algorithm will try to nd
the longest repeated sub-string from the children of the tree.
1: function LONGEST-REPEATED-SUBSTRING(T)
2: if T is leaf then
3: return Empty
4: else
5: return PROC(CHILDREN(T))
1: function PROC(L)
2: if L is empty then
3: return Empty
4: else
5: (s, node) FIRST(L)
6: x s +LONGEST REPEATED SUBSTRING
(T)
7: y PROC(REST(L))
8: if LENGTH(x) > LENGTH(y) then
9: return x
10: else
11: return y
In PROC function, the rst element, which is a pair of edge string and a
child node, will be examine rstly. We recursively call the algorithm to nd
the longest repeated sub-string from the child node, and append it to the edge
string. Then we compare this candidate sub string with the result obtained
from the rest of the children. The longer one will be returned as the nal result.
Note that in case x and y have the same length, it is easy to modify the
program to return both of them.
Find the longest repeated sub-string in Haskell
Well provide 2 versions of Haskell implementation. One version just returns the
rst candidate in case there are multiple sub-strings which have the same length
as the longest sub-string. The other version returns all possible candidates.
isLeaf::Tr Bool
isLeaf Lf = True
isLeaf _ = False
lrs::TrString
lrs Lf = ""
lrs (Br lst) = find $ filter (not isLeaf snd) lst where
find [] = ""
find ((s, t):xs) = maximumBy (compare on length) [s++(lrs t), find xs]
In this version, we used the maximumBy function provided in Data.List
module. it will only return the rst maximum value in a list. In order to return
all maximum candidates, we need provide a customized function.
maxBy::(Ord a)(aaOrdering)[a][a]
maxBy _ [] = []
maxBy cmp (x:xs) = foldl maxBy [x] xs where
230CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
maxBy lst y = case cmp (head lst) y of
GT lst
EQ lst ++ [y]
LT [y]
lrs::Tr[String]
lrs Lf = [""]
lrs (Br lst) = find $ filter (not isLeaf snd) lst where
find [] = [""]
find ((s, t):xs) = maxBy (compare on length)
((map (s++) (lrs t)) ++ (find xs))
We can feed some simple test cases and compare the results of these 2 dif-
ferent program to see their dierence.
testLRS s = "LRS(" ++ s ++ ")=" ++ (show $ lrs $ suffixTree (s++"$")) ++ "n"
testLRS s = "LRS(" ++ s ++ ")=" ++ (lrs $ suffixTree (s++"$")) ++ "n"
test = concat [ f s | s["mississippi", "banana", "cacao", "foofooxbarbar"],
f[testLRS, testLRS]]
Below are the results printed out.
LRS(mississippi)=["issi"]
LRS(mississippi)=issi
LRS(banana)=["ana"]
LRS(banana)=ana
LRS(cacao)=["ca"]
LRS(cacao)=ca
LRS(foofooxbarbar)=["bar","foo"]
LRS(foofooxbarbar)=foo
Find the longest repeated sub-string in Scheme/Lisp
Because the underground data structure is list in Scheme/Lisp, in order to access
the sux tree components easily, some helper functions are provided.
(define (edge t)
(car t))
(define (children t)
(cdr t))
(define (leaf? t)
(null? (children t)))
Similar with the Haskell program, a function which can nd all the maximum
values on a special measurement rules are given.
(define (compare-on func)
(lambda (x y)
(cond ((< (func x) (func y)) lt)
((> (func x) (func y)) gt)
(else eq))))
6.5. SUFFIX TREE APPLICATIONS 231
(define (max-by comp lst)
(define (update-max xs x)
(case (comp (car xs) x)
(lt (list x))
(gt xs)
(else (cons x xs))))
(if (null? lst)
()
(fold-left update-max (list (car lst)) (cdr lst))))
Then the main function for searching the longest repeated sub-strings can
be implemented as the following.
(define (lrs t)
(define (find lst)
(if (null? lst)
("")
(let ((s (edge (car lst)))
(tr (children (car lst))))
(max-by (compare-on string-length)
(append
(map (lambda (x) (string-append s x)) (lrs tr))
(find (cdr lst)))))))
(if (leaf? t)
("")
(find (filter (lambda (x) (not (leaf? x))) t))))
(define (longest-repeated-substring s)
(lrs (suffix-tree (string-append s TERM1))))
Where TERM1 is dened as $ string.
Same test cases can be used to verify the results.
(define (test-main)
(let ((fs (list longest-repeated-substring))
(ss ("mississippi" "banana" "cacao" "foofooxbarbar")))
(map (lambda (f) (map f ss)) fs)))
This test program can be easily extended by adding new test functions as a
element of fs list. the result of the above function is as below.
(test-main)
;Value 16: ((("issi") ("ana") ("ca") ("bar" "foo")))
6.5.3 Find the longest common sub-string
The longest common sub-string of two strings, can also be quickly found by
using sux tree. A typical solution is to build a generalized sux tree for two
strings. If the two strings are denoted as txt
1
and txt
2
, a generalized sux tree
is SuffixTree(txt
1
$
1
txt
2
$
2
). Where $
1
is a special terminator character for
txt
1
, and $
2
is another special terminator character for txt
2
.
The longest common sub-string is indicated by the deepest branch node, with
two forks corresponding to both ...$
1
... and ...$
2
(no $
1
). The denition of
232CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
the deepest node is as same as the one for the longest repeated sub-string, it is
the number of characters traversed from root.
If a node has ...$
1
... beneath it, then the node must represent to a sub-
string of txt
1
, as $
1
is the terminator of txt
1
. On the other hand, since it also
has ...$
2
(without $
1
) child, this node must represent to a sub-string of txt
2
too. Because of its a deepest one satised this criteria. The node indicates to
the longest common sub-string.
Find the longest common sub-string imperatively
Based on the above analysis, a BFS (bread rst search) algorithm can be used
to nd the longest common sub-string.
1: function LONGEST-COMMON-SUBSTRING(T)
2: Q (NIL, ROOT(T))
3: R NIL
4: while Q is not empty do
5: (s, node) POP(Q)
6: if MATCH FORK(node) then
7: UPDATE(R, s)
8: for each ((l, r), node
) in CHILDREN(node) do
9: if node
, node
))
12: return R
The most part is as same as the algorithm for nding the longest repeated
sub-sting. The function MATCH FORK() will check if the children of a
node satisfy the common sub-string criteria.
Find the longest common sub-string in Python
By translate the imperative algorithm in Python, the following program can be
obtained.
def lcs(t):
queue = [("", t.root)]
res = []
while len(queue)>0:
(s, node) = queue.pop(0)
if match_fork(t, node):
res = update_max(res, s)
for _, (str_ref, tr) in node.children.items():
if len(tr.children)>0:
s1 = s + t.substr(str_ref)
queue.append((s1, tr))
return res
Where we dene the function match fork() as below.
def is_leaf(node):
return node.children=={}
def match_fork(t, node):
6.5. SUFFIX TREE APPLICATIONS 233
if len(node.children)==2:
[(_, (str_ref1, tr1)), (_, (str_ref2, tr2))]=node.children.items()
return is_leaf(tr1) and is_leaf(tr2) and
(t.substr(str_ref1).find(TERM2)!=-1) !=
(t.substr(str_ref2).find(TERM2)!=-1)
return False
In this function, it checks if the two children of a node are both leaf, and
one contains TERM2 character, while the other doesnt. This is because if one
child node is a leaf, it will always contains TERM1 character according to the
denition of sux tree.
Note, the main interface of the function is to add TERM2 to the rst string
and append TERM1 to the second string.
class STreeUtil:
def __init__(self):
self.tree = None
def __lazy__(self, str):
if self.tree is None or self.tree.str!=str+TERM1:
self.tree = stree.suffix_tree(str+TERM1)
def find_lcs(self, s1, s2):
self.__lazy__(s1+TERM2+s2)
return lcs(self.tree)
We can test this program like below:
util = STreeUtil()
print "longestcommonsubstringofababaandbaby=", util.find_lcs("ababa", "baby")
And the output will be something like:
longest common substring of ababx and baby = [bab]
Find the longest common sub-string in C++
In C++ implementation, we rst dene the special terminator characters for
the generalized sux tree of two strings.
const char TERM1 = $;
const char TERM2 = #;
Since the program need frequently test if a node is a branch node or leaf
node, a helper function is provided.
bool is_leaf(Node node){
return nodechildren.empty();
}
The criteria for a candidate node is that it has two children. One in pattern
...#..., the other in pattern ...$.
bool match_fork(Node node){
if(nodechildren.size() == 2){
RefPair rp1, rp2;
Node::Children::iterator it = nodechildren.begin();
rp1 = (it++)second;
234CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
rp2 = itsecond;
return (is_leaf(rp1.node()) && is_leaf(rp2.node())) &&
(rp1.str().substr().find(TERM2)!=std::string::npos)!=
(rp2.str().substr().find(TERM2)!=std::string::npos);
}
return false;
}
The main program in BFS(Bread First Search) approach is given as below.
Strings lcs(const STree t){
std::queue<std::pair<std::string, Node> > q;
Strings res;
q.push(std::make_pair(std::string(""), troot));
while(!q.empty()){
std::string s;
Node node;
tie(s, node) = q.front();
q.pop();
if(match_fork(node))
update_max(res, s);
for(Node::Children::iterator it = nodechildren.begin();
it!=nodechildren.end(); ++it){
RefPair rp = itsecond;
if(!is_leaf(rp.node())){
std::string s1 = s + rp.str().substr();
q.push(std::make_pair(s1, rp.node()));
}
}
}
return res;
}
After that we can nalize the interface in a lazy way as the following.
class STreeUtil{
public:
//...
Strings find_lcs(std::string s1, std::string s2){
lazy(s1+TERM2+s2);
return lcs(t);
}
//...
This C++ program can generate similar result as the Python one if same
test cases are given.
longest common substring of ababa, baby =[bab, ]
Find the longest common sub-string recursively
The longest common sub-string nding algorithm can also be realized in func-
tional way.
1: function LONGEST-COMMON-SUBSTRING(T)
2: if T is leaf then
3: return Empty
6.5. SUFFIX TREE APPLICATIONS 235
4: else
5: return PROC(CHILDREN(T))
If the generalized sux tree is just a leaf, empty string is returned to indicate
the trivial result. In other case, we need process the children of the tree.
1: function PROC(L)
2: if L is empty then
3: return Empty
4: else
5: (s, node) FIRST(L)
6: if MATCH FORK(node) then
7: x s
8: else
9: x LONGEST COMMON SUBSTRING
(node)
10: if x = Empty then
11: x s +x
12: y PROC(LEFT(L))
13: if LENGTH(x) > LENGTH(y) then
14: return x
15: else
16: return y
If the children list is empty, the algorithm returns empty string. In other
case, the rst element, as a pair of edge string and a child node, is rst picked,
if this child node match the fork criteria (one is in pattern ...$
1
..., the other in
pattern ...$
2
without $
1
), then the edge string is a candidate. The algorithm
will process the rest children list and compare with this candidate. The longer
one will be returned as the nal result. If it doesnt match the fork criteria, we
need go on nd the longest common sub-string from this child node recursively.
and do the similar comparison afterward.
Find the longest common sub-string in Haskell
Similar as the longest repeated sub-string problem, there are two alternative,
one is to just return the rst longest common sub-string. The other is to return
all the candidates.
lcs::Tr[String]
lcs Lf = []
lcs (Br lst) = find $ filter (not isLeaf snd) lst where
find [] = []
find ((s, t):xs) = maxBy (compare on length)
(if match t
then s:(find xs)
else (map (s++) (lcs t)) ++ (find xs))
Most of the program is as same as the one for nding the longest repeated
sub-string. The match function is dened to check the fork criteria.
match (Br [(s1, Lf), (s2, Lf)]) = ("#" isInfixOf s1) /= ("#" isInfixOf s2)
match _ = False
If the function maximumBy dened in Data.List is used, only the rst
candidate will be found.
236CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
lcs::TrString
lcs Lf = ""
lcs (Br lst) = find $ filter (not isLeaf snd) lst where
find [] = ""
find ((s, t):xs) = maximumBy (compare on length)
(if match t then [s, find xs]
else [tryAdd s (lcs t), find xs])
tryAdd x y = if y=="" then "" else x++y
We can test this program by some simple cases, below are the snippet of the
result in GHCi.
lcs $ suffixTree "baby#ababa$"
["bab"]
Find the longest common sub-string in Scheme/Lisp
It can be found from the Haskell program, that the common structure of the
lrs and lcs are very similar to each other, this hint us that we can abstract to a
common search function.
(define (search-stree t match)
(define (find lst)
(if (null? lst)
()
(let ((s (edge (car lst)))
(tr (children (car lst))))
(max-by (compare-on string-length)
(if (match tr)
(cons s (find (cdr lst)))
(append
(map (lambda (x) (string-append s x)) (search-stree tr match))
(find (cdr lst))))))))
(if (leaf? t)
()
(find (filter (lambda (x) (not (leaf? x))) t))))
This function takes a sux tree and a function to test if a node match a
certain criteria. It will lter out all leaf node rst, then repeatedly check if each
branch node match the criteria. If matches the function will compare the edge
string to see if it is the longest one, else, it will recursively check the child node
until either fails or matches.
The longest common sub-string function can be then implemented with this
function.
(define (xor x y)
(not (eq? x y)))
(define (longest-common-substring s1 s2)
(define (match-fork t)
(and (eq? 2 (length t))
(and (leaf? (car t)) (leaf? (cadr t)))
(xor (substring? TERM2 (edge (car t)))
(substring? TERM2 (edge (cadr t))))))
(search-stree (suffix-tree (string-append s1 TERM2 s2 TERM1)) match-fork))
6.5. SUFFIX TREE APPLICATIONS 237
We can test this function with some simple cases:
(longest-common-substring "xbaby" "ababa")
;Value 11: ("bab")
(longest-common-substring "ff" "bb")
;Value: ()
6.5.4 Find the longest palindrome in a string
A palindrome is a string, S, such that S = reverse(S), for instance, in English,
level, rotator, civic are all palindrome.
The longest palindrome in a string s
1
s
2
...s
n
can be found in O(n) time with
sux tree. The solution can be benet from the longest common sub-string
problem.
For string S, if a sub-string w is a palindrome, then it must be sub-string
of reverse(S) too. for instance, issi is a palindrome, it is a sub-string of
mississippi. When turns it reversed to ippississim, we found that issi is
also a sub-string.
Based on this truth, we can nd get the longest palindrome by nding the
longest common sub-string for S and reverse(S).
The algorithm is straightforward for both imperative and functional ap-
proach.
function LONGEST-PALINDROME(S)
return LONGESTCOMMONSUBSTRING(SUFFIXTREE(S+
REV ERSE(S)))
Find the longest palindrome in Python
In Python we can reverse a string s like: s[::-1], which means we start from the
beginning to ending with step -1.
class STreeUtil:
#...
def find_lpalindrome(self, s):
return self.find_lcs(s, s[::-1]) #l[::-1] = reverse(l)
We can feed some simple test cases to check if the program can nd the
palindrome.
class StrSTreeTest:
def test_lpalindrome(self):
self.__test_lpalindrome__("mississippi")
self.__test_lpalindrome__("banana")
self.__test_lpalindrome__("cacao")
self.__test_lpalindrome__("Woolloomooloo")
def __test_lpalindrome__(self, s):
print "longestpalindromeof", s, "=", self.util.find_lpalindrome(s)
The result is something like the following.
longest palindrome of mississippi = [ississi]
longest palindrome of banana = [anana]
238CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
longest palindrome of cacao = [aca, cac]
longest palindrome of Woolloomooloo = [loomool]
Find the longest palindrome in C++
C++ program is just delegate the call to longest common sub-string function.
Strings find_lpalindrome(std::string s){
std::string s1(s);
std::reverse(s1.begin(), s1.end());
return find_lcs(s, s1);
}
The test cases are added as the following.
class StrSTreeTest{
public:
//...
void test_lrs(){
__test_lrs("mississippi");
__test_lrs("banana");
__test_lrs("cacao");
__test_lrs("foofooxbarbar");
}
private:
//...
void __test_lpalindrome(std::string s){
std::cout<<"longestpalindromeof"<<s<<"="
<<util.find_lpalindrome(s)<<"n";
}
Running the test cases generate the same result.
longest palindrome of mississippi =[ississi, ]
longest palindrome of banana =[anana, ]
longest palindrome of cacao =[aca, cac, ]
longest palindrome of Woolloomooloo =[loomool, ]
Find the longest palindrome in Haskell
Haskell program of nding the longest palindrome is implemented as below.
longestPalindromes s = lcs $ suffixTree (s++"#"++(reverse s)++"$")
If some strings are fed to the program, results like the following can be
obtained.
longest palindrome(mississippi)=["ississi"]
longest palindrome(banana)=["anana"]
longest palindrome(cacao)=["aca","cac"]
longest palindrome(foofooxbarbar)=["oofoo"]
Find the longest palindrome in Scheme/Lisp
Scheme/Lisp program of nding the longest palindrome is realized as the fol-
lowing.
6.6. NOTES AND SHORT SUMMARY 239
(define (longest-palindrome s)
(longest-common-substring (string-append s TERM2)
(string-append (reverse-string s) TERM1)))
We can just add this function to the fs list in test-main program, so that the
test will automatically done.
(define (test-main)
(let ((fs (list longest-repeated-substring longest-palindrome))
(ss ("mississippi" "banana" "cacao" "foofooxbarbar")))
(map (lambda (f) (map f ss)) fs)))
The relative result snippet is as below.
(test-main)
;Value 12: (... (("ississi") ("anana") ("aca" "cac") ("oofoo")))
6.5.5 Others
Sux tree can also be used in data compression, Burrows-Wheeler transform,
LZW compression (LZSS) etc. [2]
6.6 Notes and short summary
Sux Tree was rst introduced by Weiner in 1973 [?]. In 1976, McCreight
greatly simplied the construction algorithm. McCreight construct the sux
tree from right to left. And in 1995, Ukkonen gave the rst on-line construction
algorithms from left to right. All the three algorithms are linear time (O(n)).
And some research shows the relationship among these 3 algorithms. [7]
6.7 Appendix
All programs provided along with this article are free for downloading.
6.7.1 Prerequisite software
GNU Make is used for easy build some of the program. For C++ and ANSI
C programs, GNU GCC and G++ 3.4.4 are used. For Haskell programs GHC
6.10.4 is used for building. For Python programs, Python 2.5 is used for testing,
for Scheme/Lisp program, MIT Scheme 14.9 is used.
all source les are put in one folder. Invoke make or make all will build
C++ and Haskell program.
Run make Haskell will separate build Haskell program. the executable le
is happ (with .exe in Window like OS). It is also possible to run the program
in GHCi.
6.7.2 Tools
Besides them, I use graphviz to draw most of the gures in this post. In order
to translate the Trie, Patrica and Sux Tree output to dot language scripts. I
wrote a python program. it can be used like this.
240CHAPTER 6. SUFFIX TREE WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
st2dot -o filename.dot -t type "string"
Where lename.dot is the output le for the dot script, type can be either
trie or tree, the default value is tree. it can generate sux Trie/tree from the
string input and turns the tree/Trie into dot script.
This helper scripts can also be downloaded with this article.
download position: http://sites.google.com/site/algoxy/stree/stree.zip
Bibliography
[1] Esko Ukkonen. On-line construction of sux trees. Al-
gorithmica 14 (3): 249260. doi:10.1007/BF01206331.
http://www.cs.helsinki./u/ukkonen/SuxT1withFigs.pdf
[2] Sux Tree, Wikipedia. http://en.wikipedia.org/wiki/Sux tree
[3] Esko Ukkonen. Sux tree and sux array techniques for pattern analysis
in strings. http://www.cs.helsinki./u/ukkonen/Erice2005.ppt
[4] Trie, Wikipedia. http://en.wikipedia.org/wiki/Trie
[5] Liu Xinyu. Trie and Patricia, with Functional and imperative implemen-
tation. http://sites.google.com/site/algoxy/trie
[6] Sux Tree (Java). http://en.literateprograms.org/Sux tree (Java)
[7] Robert Giegerich and Stefan Kurtz. From Ukkonen to McCreight
and Weiner: A Unifying View of Linear-Time Sux Tree Con-
struction. Science of Computer Programming 25(2-3):187-218, 1995.
http://citeseer.ist.psu.edu/giegerich95comparison.html
[8] Robert Giegerich and Stefan Kurtz. A Comparison of Imper-
ative and Purely Functional Sux Tree Constructions. Algo-
rithmica 19 (3): 331353. doi:10.1007/PL00009177. www.zbh.uni-
hamburg.de/pubs/pdf/GieKur1997.pdf
[9] Bryan OSullivan. suxtree: Ecient, lazy sux tree implementation.
http://hackage.haskell.org/package/suxtree
[10] Danny. http://hkn.eecs.berkeley.edu/ dyoo/plt/suxtree/
[11] Zhang Shaojie. Lecture of Sux Trees.
http://www.cs.ucf.edu/ shzhang/Combio09/lec3.pdf
[12] Lloyd Allison. Sux Trees. http://www.allisons.org/ll/AlgDS/Tree/Sux/
[13] Esko Ukkonen. Sux tree and sux array techniques for pattern analysis
in strings. http://www.cs.helsinki./u/ukkonen/Erice2005.ppt
[14] Esko Ukkonen Approximate string-matching over sux trees. Proc. CPM
93. Lecture Notes in Computer Science 684, pp. 228-242, Springer 1993.
http://www.cs.helsinki./u/ukkonen/cpm931.ps
241
242 B-Trees
B-Trees with Functional and imperative implementation
Larry LIU Xinyu Larry LIU Xinyu
Email: [email protected]
Chapter 7
B-Trees with Functional
and imperative
implementation
7.1 abstract
B-Tree is introduced by Introduction to Algorithms book[2] as one of the
advanced data structures. It is important to the modern le systems, some of
them are implemented based on B+ tree, which is extended from B-tree. It is
also widely used in many database systems. This post provides some implemen-
tation of B-trees both in imperative way as described in [2] and in functional
way with a kind of modify-and-x approach. There are multiple programming
languages used, including C++, Haskell, Python and Scheme/Lisp.
There may be mistakes in the post, please feel free to point out.
This post is generated by L
A
T
E
X2
CHILDREN(node)[t...2t]
7: return (CREATE B TREE(ks, cs), KEY S(node)[t], CREATE
B TREE(ks
, cs
))
Split implemented in Haskell
Haskell prelude provide take/drop functions to get the part of the list. These
functions just returns empty list if the list passed in is empty. So there is no
need to test if the node is leaf.
split :: BTree a (BTree a, a, BTree a)
split (Node ks cs t) = (c1, k, c2) where
c1 = Node (take (t-1) ks) (take t cs) t
c2 = Node (drop t ks) (drop t cs) t
k = head (drop (t-1) ks)
Split implemented in Scheme/Lisp
As mentioned previously, the minimum degree t is passed as an argument. The
splitting is performed according to t.
(define (split tr t)
(if (leaf? tr)
(list (list-head tr (- t 1))
(list-ref tr (- t 1))
(list-tail tr t))
(list (list-head tr (- ( t 2) 1))
(list-ref tr (- ( t 2) 1))
(list-tail tr ( t 2)))))
When splitting a leaf node, because there is no child at all, the program
simply take the rst t 1 keys and the last t 1 keys to form two child, and
left the t-th key as the only key of the new node. It will return these 3 parts in
a list. When splitting a branch node, children must be also taken into account,
thats why the rst 2t 1 and the last 2t 1 elements are taken.
7.4.2 Split before insert method
Note that the split solution will push a key up to its parent node, It is possible
that the parent node be full if it has already 2t 1 keys.
Regarding to this issue, the [2] provides a solution to check every node from
root along the path until leaf, in case there is a node in this path is full. the
split is applied. Since the parent of this node has been examined except the
root node, which ensure the parent node has less than 2t 1 keys, the pushing
up of one key wont make the parent full. This approach need only a single pass
down the tree without need of any back-tracking.
The main insert algorithm will rst check if the root node need split. If
yes, it will create a new node, and set the root as the only child, then performs
splitting. and set the new node as the root. After that, the algorithm will try
to insert the key to the non-full node.
7.4. INSERTION 251
1: function B-TREE-INSERT(T, k)
2: r T
3: if r is full then
4: s CREATE NODE()
5: APPEND(CHILDREN(s), r)
6: B TREE SPLIT CHILD(s, 1)
7: r s
8: B TREE INSERT NONFULL(r, k)
9: return r
The algorithm B TREE INSERT NONFUL assert that the node
passed in is not full. If it is a leaf node, the new key is just inserted to the
proper position based on its order. If it is a branch node. The algorithm nds
a proper child node to which the new key will be inserted. If this child node is
full, the splitting will be performed rstly.
1: procedure B-TREE-INSERT-NONFUL(T, k)
2: if T is leaf then
3: i 1
4: while i LENGTH(KEY S(T)) and k > KEY S(T)[i] do
5: i i + 1
6: INSERT(KEY S(T), i, k)
7: else
8: i LENGTH(KEY S(T))
9: while i > 1andk < KEY S(T)[i] do
10: i i 1
11: if CHILDREN(T)[i] is full then
12: B TREE SPLIT CHILD(T, i)
13: if k > KEY S(T)[i] then
14: i i + 1
15: B TREE INSERT NONFULL(CHILDREN(T)[i], k)
Note that this algorithm is actually recursive. Consider B-tree typically has
minimum degree t relative to magnetic disk structure, and it is balanced tree,
Even small depth can support huge amount of data (with t = 10, maximum to
10 billion data can be stored in a B-tree with height of 10). Of course it is easy
to eliminate the recursive call to improve the algorithm.
In the below language specic implementations, Ill eliminate recursion in
C++ program, and show the recursive version in Python program.
Insert implemented in C++
The main insert program in C++ examine if the root is full and performs
splitting accordingly. Then it will call insert nonfull to do the further process.
template<class K, int t>
BTree<K, t> insert(BTree<K, t> tr, K key){
BTree<K, t> root(tr);
if(rootfull()){
BTree<K, t> s = new BTree<K, t>();
schildren.push_back(root);
ssplit_child(0);
root = s;
252CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
}
return insert_nonfull(root, key);
}
The recursion is eliminated in insert nonfull function. If the current node
is leaf, it will call ordered insert to insert the key to the correct position. If
it is branch node, the program will nd the proper child tree and set it as the
current node in next loop. Splitting is performed if the child tree is full.
template<class K, int t>
BTree<K, t> insert_nonfull(BTree<K, t> tr, K key){
typedef typename BTree<K, t>::Keys Keys;
typedef typename BTree<K, t>::Children Children;
BTree<K, t> root(tr);
while(!trleaf()){
unsigned int i=0;
while(i < trkeys.size() && trkeys[i] < key)
++i;
if(trchildren[i]full()){
trsplit_child(i);
if(key > trkeys[i])
++i;
}
tr = trchildren[i];
}
ordered_insert(trkeys, key);
return root;
}
Where the ordered insert is dened as the following.
template<class Coll>
void ordered_insert(Coll& coll, typename Coll::value_type x){
typename Coll::iterator it = coll.begin();
while(it != coll.end() && it < x)
++it;
coll.insert(it, x);
}
For convenience, I dened auxiliary functions to convert a list of keys into
the B-tree.
template<class T>
T insert_key(T t, typename T::key_type x){
return insert(t, x);
}
template<class Iterator, class T>
T list_to_btree(Iterator first, Iterator last, T t){
return std::accumulate(first, last, t,
std::ptr_fun(insert_key<T>));
}
In order to print the result as human readable string, a recursive convert
function is provided.
7.4. INSERTION 253
template<class T>
std::string btree_to_str(T tr){
typename T::Keys::iterator k;
typename T::Children::iterator c;
std::ostringstream s;
s<<"(";
if(trleaf()){
k=trkeys.begin();
s<<k++;
for(; k!=trkeys.end(); ++k)
s<<","<<k;
}
else{
for(k=trkeys.begin(), c=trchildren.begin();
k!=trkeys.end(); ++k, ++c)
s<<btree_to_str(c)<<","<<k<<",";
s<<btree_to_str(c);
}
s<<")";
return s.str();
}
With all the above dened program, some simple test cases can be fed to
verify the program.
const char ss[] = {"G", "M", "P", "X", "A", "C", "D", "E", "J", "K",
"N", "O", "R", "S", "T", "U", "V", "Y", "Z"};
BTree<std::string, 2> tr234=list_to_btree(ss, ss+sizeof(ss)/sizeof(char),
new BTree<std::string, 2>);
std::cout<<"2-3-4treeof";
std::copy(ss, ss+sizeof(ss)/sizeof(char),
std::ostream_iterator<std::string>(std::cout, ","));
std::cout<<"n"<<btree_to_str(tr234)<<"n";
delete tr234;
BTree<std::string, 3> tr = list_to_btree(ss, ss+sizeof(ss)/sizeof(char),
new BTree<std::string, 3>);
std::cout<<"B-treewitht=3of";
std::copy(ss, ss+sizeof(ss)/sizeof(char),
std::ostream_iterator<std::string>(std::cout, ","));
std::cout<<"n"<<btree_to_str(tr)<<"n";
delete tr;
Run these lines will generate the following result:
2-3-4 tree of G, M, P, X, A, C, D, E, J, K, N, O, R, S, T, U, V, Y, Z,
(((A), C, (D)), E, ((G, J, K), M, (N, O)), P, ((R), S, (T), U, (V), X, (Y, Z)))
B-tree with t=3 of G, M, P, X, A, C, D, E, J, K, N, O, R, S, T, U, V, Y, Z,
((A, C), D, (E, G, J, K), M, (N, O), P, (R, S), T, (U, V, X, Y, Z))
Figure 7.4 shows the result.
254CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
E P
C M S U X
A D G J K N O R T V Y Z
a. Insert result of a 2-3-4 tree,
D M P T
A C E G J K N O R S U V X Y Z
b. Insert result of a B-tree with minimum degree of 3.
Figure 7.4: insert result
Insert implemented in Python
Implement the above insertion algorithm in Python is straightforward, we change
the index starts from 0 instead of 1.
def B_tree_insert(tr, key): # + data parameter
root = tr
if root.is_full():
s = BTreeNode(root.t, False)
s.children.insert(0, root)
s.split_child(0)
root = s
B_tree_insert_nonfull(root, key)
return root
And the insertion to non-full node is implemented as the following.
def B_tree_insert_nonfull(tr, key):
if tr.leaf:
ordered_insert(tr.keys, key)
#disk_write(tr)
else:
i = len(tr.keys)
while i>0 and key < tr.keys[i-1]:
i = i-1
#disk_read(tr.children[i])
if tr.children[i].is_full():
tr.split_child(i)
if key>tr.keys[i]:
i = i+1
B_tree_insert_nonfull(tr.children[i], key)
Where the function ordered insert function is used to insert an element
to an ordered list. Since Python standard list dont support order information.
The program is written as below.
7.4. INSERTION 255
def ordered_insert(lst, x):
i = len(lst)
lst.append(x)
while i>0 and lst[i]<lst[i-1]:
(lst[i-1], lst[i]) = (lst[i], lst[i-1])
i=i-1
For the array based collection, append on the tail is much more eective
than insert in other position, because the later takes O(n) time, if the length of
the collection is n. This program will rst append the new element at the end
of the existing collection, then iterate from the last element to the rst one, and
check if the current two elements next to each other are ordered. If not, these
two elements will be swapped.
For easily creating a B-tree from a list of keys, we can write a simple helper
function.
def list_to_B_tree(l, t=TREE_2_3_4):
tr = BTreeNode(t)
for x in l:
tr = B_tree_insert(tr, x)
return tr
By default, this function will create a 2-3-4 tree, and user can specify the
minimum degree as the second parameter. The rst parameter is a list of keys.
This function will repeatedly insert every key into the B-tree which starts from
an empty tree.
In order to print the B-tree out for verication, an auxiliary printing function
is provided.
def B_tree_to_str(tr):
res = "("
if tr.leaf:
res += ",".join(tr.keys)
else:
for i in range(len(tr.keys)):
res+= B_tree_to_str(tr.children[i]) + "," + tr.keys[i] + ","
res += B_tree_to_str(tr.children[len(tr.keys)])
res += ")"
return res
After that, some smoke test cases can be use to verify the insertion program.
class BTreeTest:
def run(self):
self.test_insert()
def test_insert(self):
lst = ["G", "M", "P", "X", "A", "C", "D", "E", "J", "K",
"N", "O", "R", "S", "T", "U", "V", "Y", "Z"]
tr = list_to_B_tree(lst)
print B_tree_to_str(tr)
print B_tree_to_str(list_to_B_tree(lst, 3))
Run the test cases prints two dierent B-trees. They are identical to the
C++ program outputs.
256CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
(((A), C, (D)), E, ((G, J, K), M, (N, O)), P, ((R), S, (T), U, (V), X, (Y, Z)))
((A, C), D, (E, G, J, K), M, (N, O), P, (R, S), T, (U, V, X, Y, Z))
7.4.3 Insert then x method
Another approach to implement B-tree insertion algorithm is just nd the po-
sition for the new key and insert it. Since such insertion may violate B-tree
properties. We can then apply a xing procedure after that. If a leaf contains
too much keys, we split it into 2 leafs and push a key up to the parent branch
node. Of course this operation may cause the parent node violate the B-tree
properties, so the algorithm need traverse from leaf to root to perform the xing.
By using recursive implementation these xing method can also be realized
from top to bottom.
1: function B-TREE-INSERT(T, k)
2: return FIX ROOT(RECURSIV E INSERT(T, k))
Where FIXROOT examine if the root node contains too many keys, and
do splitting if necessary.
1: function FIX-ROOT(T)
2: if FULL?(T) then
3: T B TREE SPLIT(T)
4: return T
And the inner function INSERT(T, k) will rst check if T is leaf node or
branch node. It will do directly insertion for leaf and recursively do insertion
for branch.
1: function RECURSIVE-INSERT(T, k)
2: if LEAF?(T) then
3: INSERT(KEY S(T), k)
4: return T
5: else
6: initialize empty arrays of k
, k
, c
, c
7: i 1
8: while i <= LENGTH(KEY S(T)) and KEY S(T)[i] < k do
9: APPEND(k
, KEY S(T)[i])
10: APPEND(c
, CHILDREN(T)[i])
11: i i + 1
12: k
CHILDREN(T)[i + 1...LENGTH(CHILDREN(T))]]
14: c CHILDREN(T)[i]
15: left (k
, c
)
16: right (k
, c
)
17: return MAKEBTREE(left, RECURSIV EINSERT(c, k), right)
Figure 7.5 shows the branch case. The algorithm rst locates the position.
for certain key k
i
, if the new key k to be inserted satisfy k
i1
< k < k
i
, Then
we need recursively insert k to child c
i
.
This position divides the node into 3 parts, the left part, the child c
i
and
the right part.
The procedure MAKE B TREE take 3 parameters, which relative to
the left part, the result after insert k to c
i
and right part. It tries to merge these
7.4. INSERTION 257
k, K[i-1]<k<K[i]
K[1] K[2] ... K[i-1] K[i] ... K[n]
insert to
C[1] C[2] ... C[i-1] C[i] C[i+1] ... C[n] C[n+1]
a. locate the child to insert,
K[1] K[2] ... K[i-1]
C[1] C[2] ... C[i-1]
k, K[i-1]<k<K[i]
C[i]
recursive insert
K[i] K[i+1] ... K[n]
C[i+1] ... C[n+1]
b. recursive insert,
Figure 7.5: Insert a key to a branch node
3 parts into a new B-tree branch node.
However, insert key into a child may make this child violate the B-tree
property if it exceed the limitation of the number of keys a node can have.
MAKE B TREE will detect such situation and try to x the problem by
splitting.
1: function MAKE-B-TREE(L, C, R)
2: if FULL?(C) then
3: return FIX FULL(L, C, R)
4: else
5: T CREATE NEW NODE()
6: KEY S(T) KEY S(L) +KEY S(R)
7: CHILDREN(T) CHILDREN(L) + [C] +CHILDREN(R)
8: return T
Where FIX FULL just calls splitting process.
1: function FIX-FULL(L, C, R)
2: (C
, K, C
) B TREE SPLIT(C)
3: T CREATE NEW NODE()
4: KEY S(T) KEY S(L) + [K] +KEY S(R)
5: CHILDREN(T) CHILDREN(L) + [C
, C
] +CHILDREN(R)
6: return T
Note that splitting may push one extra key up to the parent node. However,
even the push-up causes the violation of B-tree property, it will be recursively
xed.
Insert implemented in Haskell
Realize the above recursive algorithm in Haskell can implement this insert-xing
program.
258CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
The main program is provided as the following.
insert :: (Ord a) BTree a a BTree a
insert tr x = fixRoot $ ins tr x
It will just call an auxiliary function ins then examine and x the root node
if contains too many keys.
import qualified Data.List as L
--...
ins :: (Ord a) BTree a a BTree a
ins (Node ks [] t) x = Node (L.insert x ks) [] t
ins (Node ks cs t) x = make (ks, cs) (ins c x) (ks, cs)
where
(ks, ks) = L.partition (<x) ks
(cs, (c:cs)) = L.splitAt (length ks) cs
The ins function uses pattern matching to handle the two dierent cases.
If the node to be inserted is leaf, it will call insert function dened in Haskell
standard library, which can insert the new key x into the proper position to
keep the order of the keys.
If the node to be inserted is a branch node, the program will recursively
insert the key to the child which has the range of keys cover x. After that,
it will call make function to combine the result together as a new node. the
examine and xing are performed also by make function.
The function xRoot rst check if the root node contains too many keys,
if it exceeds the limit, splitting will be applied. The split result will be used to
make a new node, so the total height of the tree increases.
fixRoot :: BTree a BTree a
fixRoot (Node [] [tr] _) = tr -- shrink height
fixRoot tr = if full tr then Node [k] [c1, c2] (degree tr)
else tr
where
(c1, k, c2) = split tr
The following is the implementation of make function.
make :: ([a], [BTree a]) BTree a ([a], [BTree a]) BTree a
make (ks, cs) c (ks, cs)
| full c = fixFull (ks, cs) c (ks, cs)
| otherwise = Node (ks++ks) (cs++[c]++cs) (degree c)
While xFull are given like below.
fixFull :: ([a], [BTree a]) BTree a ([a], [BTree a]) BTree a
fixFull (ks, cs) c (ks, cs) = Node (ks++[k]++ks)
(cs++[c1,c2]++cs) (degree c)
where
(c1, k, c2) = split c
In order to print B-tree content out, an auxiliary function toString is pro-
vided to convert a B-tree to string.
toString :: (Show a)BTree a String
toString (Node ks [] _) = "("++(L.intercalate ", " (map show ks))++")"
7.4. INSERTION 259
toString tr = "("++(toStr (keys tr) (children tr))++")" where
toStr (k:ks) (c:cs) = (toString c)++", "++(show k)++", "++(toStr ks cs)
toStr [] [c] = toString c
With all the above denition, the insertion program can be veried with
some simple test cases.
listToBTree::(Ord a)[a]IntBTree a
listToBTree lst t = foldl insert (empty t) lst
testInsert = do
putStrLn $ toString $ listToBTree "GMPXACDEJKNORSTUVYZ" 3
putStrLn $ toString $ listToBTree "GMPXACDEJKNORSTUVYZ" 2
Run testInsert will generate the following result.
((A, C, D, E), G, (J, K), M, (N, O), P, (R, S),
T, (U, V, X, Y, Z))
(((A), C, (D)), E, ((G, J, K), M, (N)), O, ((P),
R, (S), T, (U), V, (X, Y, Z)))
E O
C M R T V
A D G J K N P S U X Y Z
a. Insert result of a 2-3-4 tree (insert-xing method),
G M P T
A C D E J K N O R S U V X Y Z
b. Insert result of a B-tree with minimum degree of 3 (insert-xing method).
Figure 7.6: insert and xing results
Compare the results output by C++ or Python program with this one, as
shown in gure 7.6 we can found that there are dierent points. However, the
B-tree built by Haskell program is still valid because all B-tree properties are
satised. The main reason for this dierence is because of the approaching
change.
Insert implemented in Scheme/Lisp
The main function for insertion in Scheme/Lisp is given as the following.
(define (btree-insert tr x t)
(define (ins tr x)
260CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
(if (leaf? tr)
(ordered-insert tr x) ;;leaf
(let ((res (partition-by tr x))
(left (car res))
(c (cadr res))
(right (caddr res)))
(make-btree left (ins c x) right t))))
(fix-root (ins tr x) t))
The program simply calls an internal function and performs xing on it. The
internal ins function examine if the current node is a leaf node. In case the
node is a leaf, it only contains keys, we can located the position and insert the
new key there. Otherwise, we partition the node into 3 parts, the left part, the
child which the recursive insertion will performed on, and the right part. The
program will do the recursive insertion and then combine these three part to a
new node. xing will be happened during the combination.
Function ordered-insert can help to traverse a ordered list and insert the
new key to proper position as below.
(define (ordered-insert lst x)
(define (insert-by less-p lst x)
(if (null? lst)
(list x)
(if (less-p x (car lst))
(cons x lst)
(cons (car lst) (insert-by less-p (cdr lst) x)))))
(if (string? x)
(insert-by string<? lst x)
(insert-by < lst x)))
In order to deal with B-trees with key types both as string and as number,
we abstract the less-than function as a parameter and pass it to an internal
function.
Function partition-by uses a similar approach.
(define (partition-by tr x)
(define (part-by pred tr x)
(if (= (length tr) 1)
(list () (car tr) ())
(if (pred (cadr tr) x)
(let ((res (part-by pred (cddr tr) x))
(left (car res))
(c (cadr res))
(right (caddr res)))
(list (cons-pair (car tr) (cadr tr) left) c right))
(list () (car tr) (cdr tr)))))
(if (string? x)
(part-by string<? tr x)
(part-by < tr x)))
Where cons-pair is a helper function which can put a key, a child in front
of a B-tree.
(define (cons-pair c k lst)
(cons c (cons k lst)))
7.5. DELETION 261
In order to xing the root of a B-tree, which contains too many keys, a
x-root function is provided.
(define (full? tr t) ;; t: minimum degree
(> (length (keys tr))
(- ( 2 t) 1)))
(define (fix-root tr t)
(cond ((full? tr t) (split tr t))
(else tr)))
When we turn the recursive insertion result to a new node, we need do xing
if the result node contains too many keys.
(define (make-btree l c r t)
(cond ((full? c t) (fix-full l c r t))
(else (append l (cons c r)))))
(define (fix-full l c r t)
(append l (split c t) r))
With all above facilities, we can test the program for verication.
In order to build the B-tree easily from a list of keys, some simple helper
functions are given.
(define (listbtree lst t)
(fold-left (lambda (tr x) (btree-insert tr x t)) () lst))
(define (strslist s)
(if (string-null? s)
()
(cons (string-head s 1) (strslist (string-tail s 1)))))
A same simple test case as the Haskell one is feed to our program.
(define (test-insert)
(listbtree (strslist "GMPXACDEJKNORSTUVYZBFHIQW") 3))
Evaluate test-insert function can get a B-tree.
((("A" "B") "C" ("D" "E" "F") "G" ("H" "I" "J" "K")) "M"
(("N" "O") "P" ("Q" "R" "S") "T" ("U" "V") "W" ("X" "Y" "Z")))
It is as same as the result output by the Haskell program.
7.5 Deletion
Deletion is another basic operation of B-tree. Delete a key from a B-tree may
cause violating of B-tree balance properties, that a node cant contains too few
keys (no less than t 1 keys, where t is minimum degree).
Similar to the approaches for insertion, we can either do some preparation
so that the node from where the key will be deleted contains enough keys; or
do some xing after the deletion if the node has too few keys.
262CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
7.5.1 Merge before delete method
In textbook[2], the delete algorithm is given as algorithm description. The
pseudo code is left as exercises. The description can be used as a good reference
when writing the pseudo code.
Merge before delete algorithm implemented imperatively
The rst case is the trivial, if the key k to be deleted can be located in node x
and x is a leaf node. we can directly remove k from x.
Note that this is a terminal case. For most B-trees which have not only a
leaf node as the root. The program will rst examine non-leaf nodes.
The second case states that, the key k can be located in node x, however, x
isnt a leaf node. In this case, there are 3 sub cases.
If the child node y precedes k contains enough keys (more than t). We
replace k in node x with k
from y.
The predecessor of k can be easily located as the last key of child y.
If y doesnt contains enough keys, while the child node z follows k contains
more than t keys. We replace k in node x with k
from z.
The successor of k can be easily located as the rst key of child z.
Otherwise, if neither y, nor z contains enough keys, we can merge y, k
and z into one new node, so that this new node contains 2t 1 keys. After
that, we can then recursively do the removing.
Note that after merge, if the current node doesnt contain any keys, which
means k is the only key in x, y and z are the only two children of x. we
need shrink the tree height by one.
The case 2 is illustrated as in gure 7.7, 7.8, and 7.9.
Note that although we use recursive way to delete keys in case 2, the recur-
sion can be turned into pure imperative way. Well show such program in C++
implementation.
the last case states that, if k cant be located in node x, the algorithm need
try to nd a child node c
i
of x, so that sub-tree c
i
may contains k. Before the
deletion is recursively applied in c
i
, we need be sure that there are at least t
keys in c
i
. If there are not enough keys, we need do the following adjustment.
We check the two sibling of c
i
, which are c
i1
and c
i+1
. If either one of
them contains enough keys (at least t keys), we move one key from x down
to c
i
, and move one key from the sibling up to x. Also we need move the
relative child from the sibling to c
i
.
This operation makes c
i
contains enough keys OK for deletion. we can
next try to delete k from c
i
recursively.
In case neither one of the two siblings contains enough keys, we then merge
c
i
, a key from x, and either one of the sibling into a new node, and do the
deletion on this new node.
7.5. DELETION 263
Figure 7.7: case 2a. Replace and delete from predecessor.
Figure 7.8: case 2b. Replace and delete from successor.
264CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Figure 7.9: case 2c. Merge and delete.
Figure 7.10: case 3a. Borrow from left sibling.
7.5. DELETION 265
Figure 7.11: case 3b. Borrow Merge and delete.
Case 3 is illustrated in gure 7.10, 7.11.
By implementing the above 3 cases into pseudo code, the B-tree delete al-
gorithm can be given as the following.
First there are some auxiliary functions to do some simple test and operations
on a B-tree.
1: function CAN-DEL(T)
2: return number of keys of T t
Function CAN DEL test if a B-tree node contains enough keys (no less
than t keys).
1: procedure MERGE-CHILDREN(T, i) Merge children i and i + 1
2: x CHILDREN(T)[i]
3: y CHILDREN(T)[i + 1]
4: APPEND(KEY S(x), KEY S(T)[i])
5: CONCAT(KEY S(x), KEY S(y))
6: CONCAT(CHILDREN(x), CHILDREN(y)
7: REMOV E(KEY S(T), i)
8: REMOV E(CHILDREN(T), i + 1)
Procedure MERGECHILDREN merges the i-th child, the i-th key, and
i + 1-th child of node T into a new child, and remove the i-th key and i + 1-th
child after merging.
With these helper functions, the main algorithm of B-tree deletion is de-
scribed as below.
1: function B-TREE-DELETE(T, k)
266CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
2: i 1
3: while i <= LENGTH(KEY S(T)) do
4: if k = KEY S(T)[i] then
5: if T is leaf then case 1
6: REMOV E(KEY S(T), k)
7: else case 2
8: if CAN DEL(CHILDREN(T)[i]) then case 2a
9: KEY S(T)[i] LAST KEY (CHILDREN(T)[i])
10: BTREEDELETE(CHILDREN(T)[i], KEY S(T)[i])
11: else if CAN DEL(CHILDREN(T)[i +1]) then case 2b
12: KEY S(T)[i] FIRST KEY (CHILDREN(T)[i +1])
13: BTREEDELETE(CHILDREN(T)[i+1], KEY S(T)[i])
14: else case 2c
15: MERGE CHILDREN(T, i)
16: B TREE DELETE(CHILDREN(T)[i], k)
17: if KEY S(T) = NIL then
18: T CHILDREN(T)[i] Shrinks height
19: return T
20: else if k < KEY S(T)[i] then
21: BREAK
22: else
23: i i + 1
24: if T is leaf then
25: return T k doesnt exist in T at all
26: if not CAN DEL(CHILDREN(T)[i]) then case 3
27: if i > 1 and CAN DEL(CHILDREN(T)[i 1]) then case 3a:
left sibling
28: INSERT(KEY S(CHILDREN(T)[i]), KEY S(T)[i 1])
29: KEY S(T)[i1] POP BACK(KEY S(CHILDREN(T)[i
1]))
30: if CHILDREN(T)[i] isnt leaf then
31: c POP BACK(CHILDREN(CHILDREN(T)[i 1]))
32: INSERT(CHILDREN(CHILDREN(T)[i]), c)
33: else if i <= LENGTH(CHILDREN(T)) and CANDEL(CHILDREN(T)[i+
1] then case 3a: right sibling
34: APPEND(KEY S(CHILDREN(T)[i]), KEY S(T)[i])
35: KEY S(T)[i] POP FRONT(KEY S(CHILDREN(T)[i +
1]))
36: if CHILDREN(T)[i] isnt leaf then
37: c POPFRONT(CHILDREN(CHILDREN(T)[i+1]))
38: APPEND(CHILDREN(CHILDREN(T)[i]), c)
39: else case 3b
40: if i > 1 then
41: MERGE CHILDREN(T, i 1)
42: else
43: MERGE CHILDREN(T, i)
44: B TREE DELETE(CHILDREN(T)[i], k) recursive delete
45: if KEY S(T) = NIL then Shrinks height
7.5. DELETION 267
46: T CHILDREN(T)[1]
47: return T
Merge before deletion algorithm implemented in C++
The C++ implementation given here isnt simply translate the above pseudo
code into C++. The recursion can be eliminated in a pure imperative program.
In order to simplify some B-tree node operation, some auxiliary member
functions are added to the B-tree node class denition.
template<class K, int t>
struct BTree{
//...
// merge children[i], keys[i], and children[i+1] to one node
void merge_children(int i){
BTree<K, t> x = children[i];
BTree<K, t> y = children[i+1];
xkeys.push_back(keys[i]);
concat(xkeys, ykeys);
concat(xchildren, ychildren);
keys.erase(keys.begin()+i);
children.erase(children.begin()+i+1);
ychildren.clear();
delete y;
}
key_type replace_key(int i, key_type key){
keys[i]=key;
return key;
}
bool can_remove(){ return keys.size() t; }
//...
Function replace key can update the i-th key of a node with a new value.
Typically, this new value is pulled from a child node as described in deletion
algorithm. It will return the new value.
Function can remove will test if a node contains enough keys for further
deletion.
Function merge children can merge the i-th child, the i-th key, and the
i +1-th children into one node. This operation is reverse operation of splitting,
it can double the keys of a node, so that such adjustment can ensure a node has
enough keys for further deleting.
Note that, unlike the other languages equipped with GC, in C++ program,
the memory must be released after merging.
This function uses concat function to concatenate two collections. It is
dened as the following.
template<class Coll>
void concat(Coll& x, Coll& y){
std::copy(y.begin(), y.end(),
std::insert_iterator<Coll>(x, x.end()));
}
268CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
With these helper functions, the main program of B-tree deleting is given as
below.
template<class T>
T del(T tr, typename T::key_type key){
T root(tr);
while(!trleaf()){
unsigned int i = 0;
bool located(false);
while(i < trkeys.size()){
if(key == trkeys[i]){
located = true;
if(trchildren[i]can_remove()){ //case 2a
key = trreplace_key(i, trchildren[i]keys.back());
trchildren[i]keys.pop_back();
tr = trchildren[i];
}
else if(trchildren[i+1]can_remove()){ //case 2b
key = trreplace_key(i, trchildren[i+1]keys.front());
trchildren[i+1]keys.erase(trchildren[i+1]keys.begin());
tr = trchildren[i+1];
}
else{ //case 2c
trmerge_children(i);
if(trkeys.empty()){ //shrinks height
T temp = trchildren[0];
trchildren.clear();
delete tr;
tr = temp;
}
}
break;
}
else if(key > trkeys[i])
i++;
else
break;
}
if(located)
continue;
if(!trchildren[i]can_remove()){ //case 3
if(i>0 && trchildren[i-1]can_remove()){
// case 3a: left sibling
trchildren[i]keys.insert(trchildren[i]keys.begin(),
trkeys[i-1]);
trkeys[i-1] = trchildren[i-1]keys.back();
trchildren[i-1]keys.pop_back();
if(!trchildren[i]leaf()){
trchildren[i]children.insert(trchildren[i]children.begin(),
trchildren[i-1]children.back());
trchildren[i-1]children.pop_back();
}
}
else if(i<trchildren.size() && trchildren[i+1]can_remove()){
// case 3a: right sibling
7.5. DELETION 269
trchildren[i]keys.push_back(trkeys[i]);
trkeys[i] = trchildren[i+1]keys.front();
trchildren[i+1]keys.erase(trchildren[i+1]keys.begin());
if(!trchildren[i]leaf()){
trchildren[i]children.push_back(trchildren[i+1]children.front());
trchildren[i+1]children.erase(trchildren[i+1]children.begin());
}
}
else{
if(i>0)
trmerge_children(i-1);
else
trmerge_children(i);
}
}
tr = trchildren[i];
}
trkeys.erase(remove(trkeys.begin(), trkeys.end(), key),
trkeys.end());
if(rootkeys.empty()){ //shrinks height
T temp = rootchildren[0];
rootchildren.clear();
delete root;
root = temp;
}
return root;
}
Please note how the recursion be eliminated. The main loop terminates only
if the current node which is examined is a leaf. Otherwise, the program will go
through the B-tree along the path which may contains the key to be deleted, and
do proper adjustment including borrowing keys from other nodes, or merging
to make the candidate nodes along this path all have enough keys to perform
deleting.
In order to verify this program, a quick and simple parsing function which
can turn a B-tree description string into a B-tree is provided. Error handling of
parsing is omitted for illusion purpose.
template<class T>
T parse(std::string::iterator& first, std::string::iterator last){
T tr = new T;
++first; //(
while(first!=last){
if(first==(){ //child
trchildren.push_back(parse<T>(first, last));
}
else if(first == , | | first == )
++first; //skip deliminator
else if(first == )){
++first;
return tr;
}
else{ //key
typename T::key_type key;
270CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
while(first!=, && first!=))
key+=first++;
trkeys.push_back(key);
}
}
//should never run here
return 0;
}
template<class T>
T str_to_btree(std::string s){
std::string::iterator first(s.begin());
return parse<T>(first, s.end());
}
After that, the testing can be performed as below.
void test_delete(){
std::cout<<"testdelete...n";
const char s="(((A,B),C,(D,E,F),G,(J,K,L),M,(N,O)),"
"P,((Q,R,S),T,(U,V),X,(Y,Z)))";
typedef BTree<std::string, 3> BTr;
BTr tr = str_to_btree<BTr>(s);
std::cout<<"beforedelete:n"<<btree_to_str(tr)<<"n";
const char ks[] = {"F", "M", "G", "D", "B", "U"};
for(unsigned int i=0; i<sizeof(ks)/sizeof(char); ++i)
tr=__test_del__(tr, ks[i]);
delete tr;
}
template<class T>
T __test_del__(T tr, typename T::key_type key){
std::cout<<"delete"<<key<<"=n";
tr = del(tr, key);
std::cout<<btree_to_str(tr)<<"n";
return tr;
}
Run test delete will generate the below result.
test delete...
before delete:
(((A, B), C, (D, E, F), G, (J, K, L), M, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete F==>
(((A, B), C, (D, E), G, (J, K, L), M, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete M==>
(((A, B), C, (D, E), G, (J, K), L, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete G==>
(((A, B), C, (D, E, J, K), L, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete D==>
((A, B), C, (E, J, K), L, (N, O), P, (Q, R, S), T, (U, V), X, (Y, Z))
delete B==>
((A, C), E, (J, K), L, (N, O), P, (Q, R, S), T, (U, V), X, (Y, Z))
delete U==>
((A, C), E, (J, K), L, (N, O), P, (Q, R), S, (T, V), X, (Y, Z))
7.5. DELETION 271
Figure 7.12, 7.13, and 7.14 show this deleting test process step by step. The
nodes modied are shaded. The rst 5 steps are as same as the example shown
in textbook[2] gure 18.8.
P
C G M T X
A B D E F J K L N O Q R S U V Y Z
a. A B-tree before performing deleting;
P
C G M T X
A B D E J K L N O Q R S U V Y Z
b. After delete key F, case 1;
Figure 7.12: Result of B-tree deleting program (1)
Merge before deletion algorithm implemented in Python
In Python implementation, detailed memory management can be handled by
GC. Similar as the C++ program, some auxiliary member functions are added
to B-tree node denition.
class BTreeNode:
#...
def merge_children(self, i):
#merge children[i] and children[i+1] by pushing keys[i] down
self.children[i].keys += [self.keys[i]]+self.children[i+1].keys
self.children[i].children += self.children[i+1].children
self.keys.pop(i)
self.children.pop(i+1)
def replace_key(self, i, key):
self.keys[i] = key
return key
def can_remove(self):
return len(self.keys) self.t
The member function names are same with the C++ program, so that the
meaning for each of them can be referred in previous sub section.
272CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
P
C G L T X
A B D E J K N O Q R S U V Y Z
c. After delete key M, case 2a;
P
C L T X
A B D E J K N O Q R S U V Y Z
d. After delete key G, case 2c;
Figure 7.13: Result of B-tree deleting program (2)
C L P T X
A B E J K N O Q R S U V Y Z
e. After delete key D, case 3b, and height is shrunk;
E L P T X
A C J K N O Q R S U V Y Z
f. After delete key B, case 3a, borrow from right sibling;
E L P S X
A C J K N O Q R T V Y Z
g. After delete key U, case 3a, borrow from left sibling;
Figure 7.14: Result of B-tree deleting program (3)
7.5. DELETION 273
In contrast to the C++ program, a recursion approach similar to the pseudo
code is used in this Python program.
def B_tree_delete(tr, key):
i = len(tr.keys)
while i>0:
if key == tr.keys[i-1]:
if tr.leaf: # case 1 in CLRS
tr.keys.remove(key)
#disk_write(tr)
else: # case 2 in CLRS
if tr.children[i-1].can_remove(): # case 2a
key = tr.replace_key(i-1, tr.children[i-1].keys[-1])
B_tree_delete(tr.children[i-1], key)
elif tr.children[i].can_remove(): # case 2b
key = tr.replace_key(i-1, tr.children[i].keys[0])
B_tree_delete(tr.children[i], key)
else: # case 2c
tr.merge_children(i-1)
B_tree_delete(tr.children[i-1], key)
if tr.keys==[]: # tree shrinks in height
tr = tr.children[i-1]
return tr
elif key > tr.keys[i-1]:
break
else:
i = i-1
# case 3
if tr.leaf:
return tr #key doesnt exist at all
if not tr.children[i].can_remove():
if i>0 and tr.children[i-1].can_remove(): #left sibling
tr.children[i].keys.insert(0, tr.keys[i-1])
tr.keys[i-1] = tr.children[i-1].keys.pop()
if not tr.children[i].leaf:
tr.children[i].children.insert(0, tr.children[i-1].children.pop())
elif i<len(tr.children) and tr.children[i+1].can_remove(): #right sibling
tr.children[i].keys.append(tr.keys[i])
tr.keys[i]=tr.children[i+1].keys.pop(0)
if not tr.children[i].leaf:
tr.children[i].children.append(tr.children[i+1].children.pop(0))
else: # case 3b
if i>0:
tr.merge_children(i-1)
else:
tr.merge_children(i)
B_tree_delete(tr.children[i], key)
if tr.keys==[]: # tree shrinks in height
tr = tr.children[0]
return tr
In order to verify the deletion program, similar test cases are fed to the
function.
def test_delete():
print "testdelete"
274CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
t = 3
tr = BTreeNode(t, False)
tr.keys=["P"]
tr.children=[BTreeNode(t, False), BTreeNode(t, False)]
tr.children[0].keys=["C", "G", "M"]
tr.children[0].children=[BTreeNode(t), BTreeNode(t), BTreeNode(t), BTreeNode(t)]
tr.children[0].children[0].keys=["A", "B"]
tr.children[0].children[1].keys=["D", "E", "F"]
tr.children[0].children[2].keys=["J", "K", "L"]
tr.children[0].children[3].keys=["N", "O"]
tr.children[1].keys=["T", "X"]
tr.children[1].children=[BTreeNode(t), BTreeNode(t), BTreeNode(t)]
tr.children[1].children[0].keys=["Q", "R", "S"]
tr.children[1].children[1].keys=["U", "V"]
tr.children[1].children[2].keys=["Y", "Z"]
print B_tree_to_str(tr)
lst = ["F", "M", "G", "D", "B", "U"]
reduce(__test_del__, lst, tr)
def __test_del__(tr, key):
print "delete", key
tr = B_tree_delete(tr, key)
print B_tree_to_str(tr)
return tr
In this test case, the B-tree is constructed manually. It is identical to the
B-tree built in C++ deleting test case. Run the test function will generate the
following result.
test delete
(((A, B), C, (D, E, F), G, (J, K, L), M, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete F
(((A, B), C, (D, E), G, (J, K, L), M, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete M
(((A, B), C, (D, E), G, (J, K), L, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete G
(((A, B), C, (D, E, J, K), L, (N, O)), P, ((Q, R, S), T, (U, V), X, (Y, Z)))
delete D
((A, B), C, (E, J, K), L, (N, O), P, (Q, R, S), T, (U, V), X, (Y, Z))
delete B
((A, C), E, (J, K), L, (N, O), P, (Q, R, S), T, (U, V), X, (Y, Z))
delete U
((A, C), E, (J, K), L, (N, O), P, (Q, R), S, (T, V), X, (Y, Z))
This result is as same as the one output by C++ program.
7.5.2 Delete and x method
From previous sub-sections, we see how complex is the deletion algorithm, There
are several cases, and in each case, there are sub cases to deal.
Another approach to design the deleting algorithm is a kind of delete-then-x
way. It is similar to the insert-then-x strategy.
7.5. DELETION 275
When we need delete a key from a B-tree, we rstly try to locate which
node this key is contained. This will be a traverse process from the root node
towards leaves. We start from root node, If the key doesnt exist in the node,
well traverse deeper and deeper until we rich a node.
If this node is a leaf node, we can remove the key directly, and then examine
if the deletion makes the node contains too few keys to maintain the B-tree
balance properties.
If it is a branch node, removing the key will break the node into two parts,
we need merge them together. The merging is a recursive process which can be
shown in gure 7.15.
Figure 7.15: Delete a key from a branch node. Removing k
i
breaks the node into
2 parts, left part and right part. Merging these 2 parts is a recursive process.
When the two parts are leaves, the merging terminates.
When do merging, if the two nodes are not leaves, we merge the keys to-
gether, and recursively merge the last child of the left part and the rst child of
the right part as one new child node. Otherwise, if they are leaves, we merely
put all keys together.
Till now, we do the deleting in straightforward way. However, deleting will
decrease the number of keys of a node, and it may result in violating the B-
tree balance properties. The solution is to perform a xing along the path we
traversed from root.
When we do recursive deletion, the branch node is broken into 3 parts.
The left part contains all keys less than k, say k
1
, k
2
, ..., k
i1
, and children
c
1
, c
2
, ..., c
i1
, the right part contains all keys greater than k, say k
i
, k
i+1
, ..., k
n+1
,
and children c
i+1
, c
i+2
, ..., c
n+1
, the child c
i
which recursive deleting applied be-
comes c
i
. We need make these 3 parts to a new node as shown in gure 7.16.
276CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
Figure 7.16: Denote c
i
as the result of recursively deleting key k, from child c
i
,
we should do xing when making the left part, c
i
and right part together to a
new node.
At this time point, we can examine if c
i
contains enough keys, it the number
of keys is to less (less than t 1, but not t in contrast to merge and delete
approach), we can either borrow a key-child pair from left part or right part,
and do a inverse operation of splitting. Figure 7.17 shows an example of borrow
from left part.
In case both left part and right part is empty, we can simply push c
i
up.
Delete and x algorithm implemented functionally
By summarize all above analysis, we can draft the delete and x algorithm.
1: function B-TREE-DELETE(T, k)
2: return FIX ROOT(DEL(T, k))
3: function DEL(T, k)
4: if CHILDREN(T) = NIL then leaf node
5: DELETE(KEY S(T), k)
6: return T
7: else branch node
8: n LENGTH(KEY S(T))
9: i LOWER BOUND(KEY S(T), k)
10: if KEY S(T)[i] = k then
11: k
l
KEY S(T)[1, ..., i 1]
12: k
r
KEY S(T)[i + 1, ..., n]
13: c
l
CHILDREN(T)[1, ..., i]
14: c
r
CHILDREN(T)[i + 1, ..., n + 1]
15: return MERGE(CREATE B TREE(k
l
, c
l
), CREATE
7.5. DELETION 277
Figure 7.17: Borrow a key-child pair from left part and un-split to a new child.
B TREE(k
r
, c
r
))
16: else
17: k
l
KEY S(T)[1, ..., i 1]
18: k
r
KEY S(T)[i, ..., n]
19: c CHILDREN(T)[i]
20: c
l
CHILDREN(T)[1, ..., i 1]
21: c
r
CHILDREN(T)[i + 1, ..., n + 1]
22: return MAKE((k
l
, c
l
), c, (k
r
, c
r
))
The main delete function will call an internal DEL function to performs the
work, after that, it will apply FIX ROOT to check if need to shrink the tree
height. So the FIXROOT function we dened in insertion section should be
updated as the following.
1: function FIX-ROOT(T)
2: if KEY S(T) = NIL then Single child, shrink the height
3: T CHILDREN(T)[1]
4: else if FULL?(T) then
5: T B TREE SPLIT(T)
6: return T
For the recursive merging, the algorithm is given as below. The left part
and right part are passed as parameters. If they are leaves, we just put all keys
together. Otherwise, we recursively merge the last child of left and the rst
child of right to a new child, and make this new merged child and the other two
parts it breaks into a new node.
1: function MERGE(L, R)
2: if L, R are leaves then
278CHAPTER 7. B-TREES WITH FUNCTIONAL AND IMPERATIVE IMPLEMENTATION
3: T CREATE NEW NODE()
4: KEY S(T) KEY S(L) +KEY S(R)
5: return T<