Text Databases
Text Databases
Gonzalo Navarro
Dept. of Computer Science, University of Chile
A text is any sequence of symbols (or characters) drawn from an alphabet. A large portion
of the information available worldwide in electronic form is actually in text form (other popular
forms are structured and multimedia information). Some examples are: natural language text
(for example books, journals, newspapers, jurisprudence databases, corporate information, and the
Web), biological sequences (ADN and protein sequences), continuous signals (such as audio, video
sequence descriptions and time functions), and so on. Recently, due to the increasing complexity of
applications, text and structured data are being merged into so-called semistructured data, which
is expressed and manipulated in formats such as XML (out of the scope of this review).
As more text data is available, the challenge to search them for queries of interest becomes more
demanding. A text database is a system that maintains a (usually large) text collection and provides
fast and accurate access to it. These two goals are relatively orthogonal, and both are critical to
profit from the text collection.
Traditional database technologies are not well suited to handle text databases. Relational
databases structure the data in fixed-length records whose values are atomic and are searched
for by equality or ranges. There is no general way to partition a text into atomic records, and
treating the whole text as an atomic value is of little interest for the search operations required by
applications. The same problem occurs with multimedia data types. Hence the need for specific
technology to manage texts.
The simplest query to a text database is just a string pattern: the user enters a string and the
system points out all the positions of the collection where the string appears. This simple approach
is insufficient for some applications where the user wishes to find a more general pattern, not just
a plain string. Patterns may contain wild cards (characters that may appear zero or more times),
optional characters (that may appear in the text or not), classes of characters (string positions that
match a set of characters rather than a single one), gaps (that match any text substring within
a range of lengths), etc. In many applications patterns must be regular expressions, composed by
simple strings and union, concatenation, or repetition of other regular expressions. These extensions
are typical of computational biology applications, but also appear in natural language search.
In addition, a text database may provide approximate searching, that is, means for recovering
from different kinds of errors that the text collection (or the user query) may contain. We may
have, for example, typing, spelling or optical character recognition errors in natural language texts;
1
experimental measurement errors or evolutionary differences in biological sequences; and noise or
irrelevant differences in continuous signals. Depending on the application, the text database should
provide a reasonable error model that permits users to specify some tolerance in their searches. A
simple error model is the edit distance, which gives an upper limit to the total number of character
insertions, deletions, and substitutions needed to match the user query with a text substring. More
complex models give lower costs to more probable changes in the sequences.
The above searches are collectively called syntactic search, because they are expressed in terms
of sequences of characters present in the text. On natural language texts, semantic search is of great
value. In this case the user expresses an information need and the system is in charge of retrieving
portions of the text collection (documents) that are relevant to that need, even if the query words
do not directly appear in the answer. A model that determines how relevant is a document with
respect to a query is necessary. This model ranks the documents and offers the highest ranked ones
to the user. Unlike in the syntactic search, there are no right or wrong answers, just better and
worse ones. However, since the time spent by the user in browsing the results is the most valuable
resource, good ranking is essential in the success of the search.
1 Syntactic Search
In syntactic searching, the desired outcome of a search pattern is clear, so the main concern of
a text database is efficiency. Two approaches are possible to pattern matching: sequential and
indexed search (these can be combined, as we will see). Sequential searching assumes that it is
not possible to preprocess the text, so only the pattern is preprocessed and then the whole text
database is sequentially scanned. Indexed searching, on the other hand, preprocesses the text so
as to build a data structure over it (an “index”), which can be used later to speed up searches.
Building an index is usually a time and memory demanding task, so indexed searching is possible
only when several conditions are met: (i) the text must be large enough to justify it against the
simpler sequential search, (ii) the text must change rarely enough so as to amortize the indexing
work over many queries, (iii) there must be enough space to hold the index, which can be quite
large in some applications.
2
every text character (actually, its search time improves as m grows). Another famous algorithm is
Backward DAWG Matching (BDM), for its average-case optimality, O(n logσ (m)/m) (where σ is
the alphabet size if we assume a uniformly distributed text). Many other algorithms are relevant
to theorists for their algorithmic features: optimal in space usage, bounded time to access the next
text character, simultaneous worst- and average-case optimality, and so on.
In practice, the most efficient string matching algorithms derive from their theoretically more
appealing variants. Shift-Or can be seen as a derivation of KMP, that is twice as fast and O(n)
in most practical cases. Horspool is a simplification of BM that is usually the fastest in searching
natural language text. Backward Nondeterministic DAWG Matching (BNDM) is a derivation of
BDM that is faster to search biological sequences of moderate length. Backward Oracle Matching
(BOM) also derives from BDM and is the fastest for longer biological sequences.
The use of finite automata is at the kernel of pattern matching. Many of the above mentioned al-
gorithms rely on automata. Figure 1 shows a nondeterministic finite automaton (NFA) that reaches
its final state every time the string it has consumed finishes with a given pattern P . If fed with the
text, this automaton will signal all the endpoints of occurrences of P in T . Algorithm KMP consists
essentially in making that NFA deterministic (a DFA), and using it to process each text character in
O(1) time. Shift-Or, on the other hand, uses the automaton directly in nondeterministic form. The
NFA states are mapped to bits in a computer word, and a constant number of logical and arithmetic
operations over the computer word simulate the update of all the NFA states simultaneously, as a
function of a new text character. The result is twice as fast as KMP in practice, and O(n) provided
m ≤ w, being w the number of bits in the computer word (32 or 64 nowadays). For m > w, Shift-Or
is still O(n) on average. This technique to map NFA states to bits is called bit-parallelism, and it
has been extensively used in pattern matching to simulate NFAs directly without converting them
to DFAs.
Σ
c a d a b r a
On Figure 2 we see an NFA that recognizes every reversed prefix of a pattern P , that is, a prefix
of P read backwards (a prefix is a substring that starts at the beginning of the string). Note that
the NFA will have active states as long as we have fed it with a reversed substring of P . This
automaton is used by BDM and BNDM, the former by converting it to deterministic and the latter
with bit-parallelism. The idea is to build such an NFA over the reverse pattern. Given a text
window (segment of length m) where the pattern could appear, we feed the automaton with the
window characters read right to left. If the automaton runs out of active states at some point, it
means that the window suffix we have read is not a substring of P , so P cannot appear in any
window containing the suffix read. Hence, we can shift the window over T to the position following
3
the window character that killed the automaton. Actually, we can remember the last time the
automaton signaled a prefix of P and shift to align the window to that position. If, on the other
hand, we read the whole window and the automaton still has active states, then P is in the window,
so we report it and shift to align the window to the last prefix recognized. Figure 3 (left) illustrates.
Algorithm BOM is also based on determinizing this NFA, albeit in an imperfect way that gives
shorter shifts but runs faster for being simpler.
ε ε ε ε ε ε ε
c a d a b r a
eda T eda T
automaton dies mismatch here
cadabra P cadabra P
shift cadabra shift cadabra
Figure 3: On the left, the BDM search process in a window. On the right, the same window
processed by Horspool.
Horspool algorithm is not based on automata. It simply compares the current text window
against P , reports an occurrence in case of equality, and then shifts the window so that its last
character c aligns with the last occurrence of c in P . This does not lose any possible occurrence
and wins because of its simplicity in all but small alphabets. Figure 3 (right) illustrates.
4
developed. When simulating general regular expressions, however, the same exponential dependency
on m appears. It is possible, using bit-parallelism, to search in O(mn/ log(s)) time given O(s) space
for the preprocessing.
Σ
ε ε
c a d a b r a
d
A
G A
A A G A A A
A A
A
A
Figure 4: On top, an NFA to recognize an extended pattern with optional and repeatable characters.
On the bottom, the NFA of a general regular expression.
5
Σ
c a d a b r a
Σ Σ Σ Σ Σ Σ Σ Σ
c a d a b r a
Σ Σ Σ Σ Σ Σ Σ Σ
c a d a b r a
Figure 5: An NFA to find "cadabra" within edit distance at most 2. Unlabeled arrows are traversed
by Σ and ε.
distance k must contain one of the pieces unaltered. A multipattern search for the pieces followed by
classical verification of areas around the occurrences is very simple and one of the fastest algorithms,
with average complexity O(nk logσ (m)/m). For small alphabets, the best algorithms in practice are
also optimal on average: O(n(k + logσ (m))/m).
Several of these techniques can be adapted to approximate searching of regular expressions and
extended patterns as well, with good results.
6
"c"
"a" 3
"$"
10
Suffix Trie Text
"r"
1 2 3 4 5 6 7 8 9 10 11
"d" 7
a b r a c a d a b r a
"c"
5
9
"b" "$"
"r" "a"
"c"
2 Suffix Array
"a" 6
"c" 4 11 8 1 4 6 2 9 5 7 10 3
"d" 1
"c"
"b" "r" "a"
"$"
8
"$"
11
Figure 6: The suffix trie and suffix array for the text "abracadabra".
just the leaves of the suffix tree, or which is the same, the sequence of pointers to all text suffixes
in lexicographic order. The suffix array needs 4 times the text size, which is more manageable
although still large. Just like all the occurrences of P appear in a single subtree of the suffix tree,
they appear in a single interval of the suffix array, which can be binary searched at O(m log n) worst
case time. Very recently, compression techniques for suffix arrays have obtained great success, but
the techniques are still immature.
Suffix trees and arrays can be searched for complex patterns too [4]. The general idea is to define
an automaton that recognizes the desired pattern, and backtrack in the suffix tree (or simulate a
backtrack on the array), going all the branches. The traversal stops either when the automaton
runs out of active states (in which case the subtree is abandoned) or when it reaches a final state
(in which case the subtree leaves are reported). On most regular expressions and approximate
searching, O(nλ ) worst case time, for some 0 < λ < 1, is achieved.
Natural language texts, at least in most Western languages, have some remarkable features that
make them easier to search [1, 6]. The text can be seen as a sequence of words. The number
of different words grows slowly with the text size. The query is usually a word or a sequence of
words (a phrase). In this case, inverted indexes are a low cost solution that provide very good
performance. An inverted index is just a list of all the different words in the text (the vocabulary),
plus a list of the occurrences of each such word in the text collection. A search for a word reduces
to a vocabulary search (with hashing, for example) plus outputting the list. A search for a complex
pattern that matches a word is solved by sequential scanning over the vocabulary plus outputting the
matching word lists. A phrase search is solved by intersecting the occurrences of the involved words.
An inverted index typically needs 15% to 40% of the space of the text, but several compression
techniques to the index and the text can be successfully applied [6].
7
2 Semantic Search
In semantic searching, the text collection is seen as a set of documents, and the goal is to establish
the relevance of each document with respect to a user query, so as to present the most relevant ones
to the user. The query can be a set of words or phrases, or even another document. As accuracy in
the answers is a central concern, the choice of a model to compute how relevant is a document with
respect to a query is fundamental [1].
2.2 Evaluation
An issue related to relevance computation is how to evaluate the accurateness of the results. The
most popular measures are called precision and recall. Precision indicates which fraction of the
retrieved documents are actually relevant. Recall indicates which fraction of the relevant documents
were actually retrieved. A human judgement is necessary to establish which are the documents really
relevant to the query. Note that it is easy to have higher precision by retrieving less documents
and higher recall by retrieving more documents, since the ranking just orders the document but
does not indicate how many to retrieve. Hence the correct measure is a “precision-recall plot”,
where the precision and recall obtained when retrieving more and more documents are drawn in a
two-dimensional plot (one coordinate for precision and another for recall). For example, one can
plot the precision obtained for recall values of 0%, 10%, 20%, . . ., 100%.
8
2.3 Indexes
Variants of the inverted index are useful for semantic searches as well. These indexes indicate
only the documents where each word appears, as well as its number of occurrences in there. This
is useful to compute term f in the vector ranking. Moreover, since only the few highest ranked
documents will be output, documents with highest f value will be the most interesting ones, so the
lists of documents where each term appears are sorted by decreasing f value. Several heuristics
are used to avoid processing long lists of probably irrelevant documents. Although these may alter
the correct outcome of the query, the relevance model is already a guess of what the user wants,
so it is admissible to distort it a bit in exchange for large savings in efficiency. This tradeoff is not
permitted in syntactic searching.
Future Trends
Textual databases face several challenges for the upcoming years.
• The increasing size of text collections, which in cases like the Web reach several terabytes.
Maintaining efficiency and accuracy of results under that scenario is much more complex than
for medium sized collections.
• The low quality of the data, with different reasons depending on the application: no quality
control in the Web, experimental errors in DNA sequences, etc. This poses the need for
approximate matching methods well suited to each application.
• The size of the indexes, which especially for general texts is more relevant than the search
speed. Over a large text, such an index requires frequent access to secondary memory, which
is much slower than main memory.
• The need to deal with secondary memory, which is a rather undeveloped feature in some in-
dexes, especially those for general texts. Some indexes still perform bad in secondary memory.
• The need to cope with changes in the text collection, as text indexes usually require expensive
rebuilding upon text changes.
• The need to provide, for text databases, the same level of data management achieved for
atomic values in traditional databases, with respect to powerful query languages (for example
permitting joins), concurrency, recovery, security, etc.
9
References
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
[2] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Scienc e and Computational
Biology. Cambridge University Press, 1997.
[3] G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–
88, 2001.
[4] G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string
matching. IEEE Data Engineering Bulletin, 24(4):19–27, 2001.
[5] G. Navarro and M. Raffinot. Flexible Pattern Matching in Strings – Practical on-line search
algorithms for texts and biological sequences. Cambridge University Press, 2002.
[6] I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Van Nostrand Reinhold, 2nd edition,
1999.
Relevant Terms
Approximate searching: searching permitting some differences between the pattern specifica-
tion and its text occurrences.
Bit-parallelism: a technique to store several values in a single computer word so as to process
them all in one shot.
Edit distance: a measure of string similarity counting the number of character insertions, dele-
tions and substitutions needed to convert one string into the other.
Indexed searching: searching with the help of an index, a data structure previously built on the
text.
Precision and recall: measures of retrieval quality, relating the documents retrieved with those
actually relevant to a query.
Text database: a database system managing a collection of texts.
Sequential searching: searching without preprocessing the text.
Suffix trees and arrays: text indexes that permit fast access to any text substring.
Semantic search: a search on natural language text based on the meaning rather than the syntax
of the text.
Syntactic search: a text search based on string patterns that appear in the text, without reference
to meaning.
Vector model: a model to measure the relevance between a document and a query, based on
sharing distinctive words.
10