Indexing: Basic Concepts
Indexing is an arrangement of index terms to
permit fast searching and reducing memory space
requirement
used to speed up access to desired information
from document collection as per users query such
that
It enhances efficiency in terms of time for
retrieval. Relevant documents are searched and
retrieved quick
Index file usually has index terms in a sorted
order. Which list is easier to search?
fox pig zebra hen ant cat dog lion ox
ant cat dog fox hen lion ox pig zebra
An index file consists of records, called index
entries. Index files are much smaller than the
original file.
Remember Heaps Law: in 1 GB of text
collection the vocabulary has a size of only 5
MB. This size may be further reduced by
Linguistic pre-processing (or text
operations).
The usual unit for indexing is the word
Index terms - are used to look up records in
a file.
Major Steps in Index Construction
Source file: Collection of text document
A document can be described by a set of
representative keywords called index terms.
Index Terms Selection: apply text operations or
preprocessing
Tokenize: identify words in a document, so that each
document is represented by a list of keywords or
attributes
Stop words removal: words with high frequency are
non-content bearing and needs to be removed from
text collection
… continued
Word stem: reduce words with similar meaning
into their stem/root word
Term relevance weight: Different index terms
have varying relevance when used to describe
document contents. This effect is captured
through the assignment of numerical weights
to each index term of a document. There are
different index terms weighting methods:
including TF, IDF,TF*IDF, …
Indexing structure: a set of index terms
(vocabulary) are organized in Index File to easily
identify documents in which each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.
Token Tokenizer
stream. Friends Romans countrymen
Modified Linguistic friend roman countryman
tokens. preprocessor
Index File Indexer
friend 2 4
roman 1 2
Inverted file countryman 13 16
Index file Evaluation Metrics
Running time of the main operations
Access/search time
How much is the running time to find the required search
key from the list?
Update time (Insertion time, Deletion time)
How much time does it take to update existing records in
an attempt to add new terms or delete existing
unnecessary terms?
Does the indexing structure allows incremental update or
re-indexing?
Space overhead
Computer storage space consumed for keeping the list.
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?
Where within each document does each term appear? (There may be
several occurrences.)
For organizing index file for a collection of documents,
there are various options available:
Decide what data structure and/or file structure to use. Is it
sequential file, inverted file, suffix tree, etc. ?
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
•The records are generally arranged serially, one after
another, but in lexicographic order on the value of
some key field. i.e
• a particular attribute is chosen as primary key whose value
will determine the order of the records.
• when the first key fails to discriminate among records, a
second key is chosen to give an order.
Example:
Given a collection of documents, they are parsed to
extract words and these are saved with the
Document ID.
I did enact Julius
Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
After all documents did
enact
1
1
Sequential file
have been julius 1
Doc
caesar 1
tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1
3 brutus 2
stemming are brutus 1
killed 1 4 capitol 1
applied, to generate me 1
so 2 5 caesar 1
index terms let 2
6 caesar 2
These index terms it
be
2
2 7 caesar 2
in sequential file are with
caesar
2
2 8 enact 1
sorted in the 2
9 julius 1
alphabetical order noble
brutus
2
2 10 kill 1
hath 2
told 2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
To access records search serially;
starting at the first record read and investigate all the
succeeding records until the required record is found
or end of the file is reached.
Its main advantages:
easy to implement;
provides fast access to the next record using
lexicographic order.
Can be searched quickly, using binary search, O(log n)
…continued
Update options: Is the index needs to be rebuilt or
incremental update is supported?
Its disadvantages:
No weights attached to terms.
Random access is slow: since similar terms are
indexed individually, we need to find all terms
that match with the query
Inverted file
A word oriented indexing mechanism based on sorted list
of keywords, with each keyword having links to the
documents containing it
Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time
This list is inverted from a list of terms in location order to a list of
terms in alphabetical order.
Word Extraction Word IDs
Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
Document IDs
•Inverted Files
Example
Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are beautiful
Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
• Space Requirements(for inverted file)
• The space required for the vocabulary is rather
small. According to Heaps’ law the vocabulary
grows as O(n), where is a constant between 0.4
and 0.6 in practice (sublinear)
• On the other hand, the occurrences demand much
more space. Since each word appearing in the text
is referenced once in that structure, the extra space
is O(n)
• To reduce space requirements, a technique called
block addressing is used
•Block Addressing
• The text is divided in blocks
• The occurrences point to the blocks where
the word appears
• Advantages:
– the number of pointers is smaller than positions
– all the occurrences of a word inside a single
block are collapsed to one reference
• Disadvantages:
– online search over the qualifying blocks if exact
positions are required
•Example
• Text:
Block 1 Block 2 Block 3 Block 4
That house has a garden. The garden has many flowers. The flowers are beautiful
• Inverted file:
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
Inverted file
Data to be held in the inverted file includes
The vocabulary (List of terms): is the set of all distinct
words (index terms) in the text collection.
Having information about vocabulary (list of terms) speeds
searching for relevant documents
For each term: the inverted file contains information
related to
Location: all the text locations/positions where the word occurs
frequency of occurrence of terms in a document collection
TFij, number of occurrences of term tj in document di
DFj, number of documents containing tj
CF, total frequency of tj in the corpus n
mi, maximum frequency of any term in di
n, total number of documents in a collection
….
Inverted file
Having information about the location of each term within
the document helps for:
user interface design: highlight location of search term
proximity based ranking: adjacency and near operators (in
Boolean searching)
Having information about frequency is used for:
calculatingterm weighting (like TF, TF*IDF, …)
optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location
term 1 3 2 1 66
This is called
19 1 213 an index file.
29 1 45
term 2 4 3 1 94
19 2 7, 212 Text operations
are performed
22 1 56
before building
term 3 1 5 1 43 the index.
term 4 3 11 2 3, 70
34 1 40
Is it possible to keep all these information during searching?
Construction of Inverted file
An inverted index consists of two files: vocabulary and
posting files
A vocabulary file (Word list):
stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
For each word a pointer to posting file
Records kept for each term j in the word list contains the
following:
term j
number of documents in which term j occurs (DFj)
Collection frequency of term j (Cf)
pointer to inverted (postings) list for term j
Postings File (Inverted List)
For each distinct term in the vocabulary, stores a list of
pointers to the documents that contain that term.
Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
Each list consists of one or many individual postings
Advantage of dividing inverted file:
Keeping a pointer in the vocabulary to the list in the posting
file allows:
the vocabulary to be kept in memory at search time even for
large text collection, while the Posting file is kept on disk for
accessing to documents
General structure of Inverted File
The following figure shows the general structure of
inverted index file.
Organization of Index File
Vocabulary
Postings
(word list) Documents
(inverted list)
Pointer
Term DF TF To
posting
term 1 3 3 Inverted
term 2 3 4 lists
term 3 1 1
term 4 2 3
Example:
Given a collection of documents, they are parsed to
extract words and these are saved with the
Document ID.
I did enact Julius
Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term Doc #
Term Doc # ambitious 2
I 1 be 2
After all documents did 1 brutus 1
enact 1 brutus 2
have been tokenized julius 1 capitol 1
the inverted file is caesar
I
1
1
caesar
caesar
1
2
sorted by terms was
killed
1
1
caesar 2
did 1
I 1 enact 1
the 1 has 1
capitol 1 I 1
brutus 1 I 1
killed 1 I 1
me 1 it 2
so 2
julius 1
let 2
killed 1
it 2
killed 1
be 2
let 2
with 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stop words, stemming & compute
frequency
Multiple term Term Doc #
entries in a ambition 2 Term Doc # TF
single ambition 2 1
brutus 1
document are brutus 1 1
brutus 2
merged and brutus 2 1
capitol 1
frequency capitol 1 1
caesar 1
information
caesar 2 caesar 1 1
added
caesar 2 caesar 2 2
enact 1 enact 1 1
julius 1 julius 1 1
kill 1 kill 1 2
kill 1 noble 2 1
noble 2
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1
Pointers
Searching on Inverted File
Since the whole index file is divided into two,
searching can be done faster by loading vocabulary list
which takes less memory even for large document
collection
Using binary Search the searching takes logarithmic
time
The search is in the vocabulary lists
Updating inverted file is complex.
We need to update both vocabulary and posting files
Example: Create Inverted file
Map the file names to file IDs
Consider the following Original Documents
D1 The Department of Computer Science was established in
1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
Suffix tree
What is Suffix? A suffix is a substring that exists at the end of the
given string.
Each position in the text is considered as a text suffix
If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i, where 1≤ i ≤ n
Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi; Exercise: generate suffix of “technology” ?
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
•A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
• To build the suffix TRIE we use these indices instead of the
actual object.
•The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no
duplicate).
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
• TEXT: G O O G O L $
POSITION: 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.
• This structure is
particularly useful
for any application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
A suffix tree is an extension of
suffix trie that construct a Trie of
all the proper suffixes of S
The suffix tree is created by
compacting unary nodes of
the suffix TRIE.
We store pointers rather
than words in the leaves.
It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of
the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example
Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are beautiful
Suffix Trie
70
‘b’ ‘ ’ 58
‘l’ ‘o’ ‘w’ ‘e’ ‘r’ ‘s’
‘f’ 45
‘g’ ‘.’
‘e’‘n’ ‘ ’ 29
‘a’ ‘r’ ‘d’
h’ 18
6 ‘.’
Example
Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are beautiful
Suffix Tree
70
‘b’ 8 ‘.’ 45
1 ‘f’ 58
‘g’ ‘ ’
‘.’ 18
‘h’ 7
‘’ 29
6
Suffix Arrays
An array containing all the pointers to the text
suffixes listed in lexicographical order
The space requirements are almost the same as those for
inverted indices
The main drawbacks of suffix array are its costly
construction process
Allow binary searches done by comparing the
contents of each pointer
Supra-indices (for large suffix array)
The space requirements of suffix array with vocabulary
supra-index are exactly the same as for inverted indices
Example
Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are beautiful
Vocabulary Supra-Index
beautiful flower garden house
Suffix Array
70 58 45 29 18 6
Inverted List
70 45 58 18 29 6
Example: Suffix tree
•Let s=abab, a suffix tree of s is a
compressed trie of all suffixes of s=abab$
•We label each leaf with the
starting point of the
•{ corresponding suffix.
• $ •$
• b$ •ab
•b •5
• ab$ •$
• bab$
• bab$ } •$ •ab$ •4
•ab$
•3
•2
•1
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s S
•To make suffixes prefix-free we add a special char, $, at the end of
s. To associate each suffix with a unique string in S add a different
special symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#'
are a special terminator for s1,s2.
•Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
•{ •a •$ •#
• $ # •b
•# •5 •4
• b$ b#
•b
• ab$ ab# •ab$ •ab$ •$
• bab$ aab# •3
•ab$ •# •4
• abab$ •$ •1
•2
•} •1 •3 •2
Search in suffix tree
Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
Pseudo-code for searching in suffix tree:
Start at root
Go down the tree by taking each time the corresponding path
If S correspond to a node, then return all leaves in sub-tree
the places where S can be found are given by the pointers in all the leaves
in the subtree rooted at x.
If S encountered a NIL pointer before reaching the end, then S
is not in the tree
Example:
If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Exercise
Given the following index terms:
worker, word, world, run & information
construct index file using suffix tree?
Suffix Tree Applications
Suffix Tree can be used to solve a large number of string
problems that occur in:
text-editing,
free-text search,
etc.
Some examples of string problems are given below.
String matching
Longest Common Substring
Longest Repeated Substring
Palindromes
etc..
Complexity Analysis
The suffix tree for a string has been built in O(n2)
time.
Searching is very fast: The search time is linear in the
length of string S.
The number of leaves is n+1, where n is the number of input
strings.
Furthermore, in the leaves, we may store either the strings
themselves or pointers to the strings (that is, integers).
Searching for a substring[1..m], in string[1..n], can be
solved in O(m) time.
Expensive memory-wise because Suffix trees consume
a lot of space
Signature Files
Characteristics
Word-oriented index structures based on hashing
Low overhead (10%~20% over the text size) at the cost of
forcing a sequential search over the index
Suitable for not very large texts
Inverted files outperform signature files for most
applications
Construction and Search
Word-oriented index structures base on hashing
Maps words to bit masks of B bits
Divides the text in blocks of b words each
The mask is obtained by bitwise ORing the signatures of
all the words in the text block.
Search
Hash the query to a bit mask W
If W & Bi = W, the text block may contain the word
For all candidate blocks, an online traversal must be
performed to verify if the word is actually there