Dynamic Indexing
Dynamic Indexing
12-2010
Dynamic indexing
Viswada Sripathi
University of Nevada, Las Vegas
Part of the Databases and Information Systems Commons, and the Library and Information Science
Commons
Repository Citation
Sripathi, Viswada, "Dynamic indexing" (2010). UNLV Theses, Dissertations, Professional Papers, and
Capstones. 759.
http://dx.doi.org/10.34917/2040701
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself.
This Thesis has been accepted for inclusion in UNLV Theses, Dissertations, Professional Papers, and Capstones by
an authorized administrator of Digital Scholarship@UNLV. For more information, please contact
[email protected].
DYNAMIC INDEXING
By
Viswada Sripathi
Graduate College
University of Nevada, Las Vegas
December 2010
ABSTRACT
Dynamic Indexing
by
Viswada Sripathi
iii
TABLE OF CONTENTS
ABSTRACT....................................................................................................................... iii
BIBLIOGRAPHY ............................................................................................................. 44
VITA ................................................................................................................................. 46
iv
LIST OF TABLES
v
LIST OF FIGURES
vi
ACKNOWLEDGEMENTS
committee chair, Dr. Kazem Taghva for all his support and guidance
help and invaluable support through my masters program and also for
Darren Paulson, Ms. Donna Ralston and Ms. Sonia Taylor for all their
help and support during my work under them. I would also like to take
their help.
brother, cousins and all my friends and family members for always being
there for me through all phases of my work, for their encouragement and
patience and giving me their invaluable love and support without which I
vii
CHAPTER 1
INTRODUCTION
storing and capturing the required information [1]. However, the field of
the users. For a given query, documents are retrieved which consists of
present in the document [3]. The retrieved documents are then ranked
the requested item partially or completely and provide the most accurate
1
In typical information retrieval process Figure 1.1 [5], the user
gives a query and the contents of the query are searched. In the
documents collected the stop words and stemming are removed and a
2
Several models have been proposed to retrieve information. The
three most commonly used retrieval models are the vector space model,
Vector Space Model. The success or failure of the vector space model
or any other keywords used to identify the contents of a text. This model
the vectors is to use the size of the angle between them. This angle can
be measured using Cosine Rule. The angle between the two vectors
would be zero i.e. =0 if the two vectors are identical which implies cos
= 1.
1960 and later many different probabilistic models were proposed with
query. In other words this model clarifies the question: what is the
3
The Inference Network (IN) model has the skill to execute ranking
contents, and the query. The Document Network (DN) and the Query
Network (QN) are the two sub-networks in the IN model. During indexing
the other document nodes are initialized to 0. This process is applied for
Finally, the final node I is used to produce the ranking and also the
everyday life. Next we discuss about indexing and its types. Finally we
4
CHAPTER 2
SEARCH ENGINES
invention of computers it has become much easier for the users to find
the recent years, World Wide Web search engines have vastly become a
maybe some sort of images, web pages, information and other types of
files like media files. Once the data has been gathered, the search
engines construct lexicons and indexes. When a user enters a query into
the search engine the user will expect the results that match the given
query. The most commonly used search engines are ‘Google’, ‘Yahoo’,
and ‘Alta Vista’. Search engines use ‘spider’ or ‘crawler’ to fetch the list of
software agent which reads each and every site. Later the data for each
web page is stored in an index. The purpose of an index is to get fast and
searching it in the search engine; here you are actually searching the
sometimes when a query is being searched we get some dead links in the
5
search results. This is because the index might be created when those
links were working and the index might not been updated after that so, it
query vector is known as cosine rule. The Cosine rule for ranking can be
Cosine (Q, Dd) = ∑ wq, t . wd,t
Wq Wd
Where,
Here, wq,t represents the query term weights and wd,t represents
algorithms to weigh these terms and which one to choose depends on the
cosine similarity measure to rank the documents. Here the values in the
6
[11]. Term weights can be calculated in different ways. Here we use the
collection of documents.
Document ID Text
wt, fd,t , rd,t , wd,t , wq,t, Wd. Here in this example the total number of
Where,
Doc
book pencil pen flower ribbon box Wd
ID
ft 3 4 2 3 3 1
The table below shows the cosine similarity measure i.e. Cosine (Q,
Dd) for two queries {box} and {pencil, box} from the collection of
documents. Here the Wq values are calculated for the given two queries.
The ranking is simple for a single query term {box} as it appears only
8
once in the document collection. For the second query we need to
calculate the cosine similarity measure for both the terms in the query
Wq = 1.79 Wq = 2.18
ranking so, the documents are sorted in the descending order of their
measure [10]. For the query {box}, the top ranked document would be
Document 2. Similarly for the query {pencil, box} order of ranking would
respectively.
9
retrieval otherwise without it the search engine has to scan all the
process where sentences are broken into words known as tokens [12].
example below shows a list of sentences ‘The box consists of toys.’ ‘So,
take it.’
10
After performing tokenization these sentences are chopped into a
list of tokens ‘The’, ‘box’, ‘consists’, ‘of’, ‘toys’, ‘So’, ‘take’, ‘it as shown in
figure 2.1.2.
In the next step, stop words are removed from the previous step.
Stop words are the frequently occurring words that are not searchable.
These words include ‘the’, ‘a’, ‘is’, ‘of’, ‘be’, ‘as’, ‘and’, ‘has’ etc. As these
words are not necessary search engines do not record these extremely
searches. The figure below shows the elimination of stop words i.e. stop
words into their base or root form [13]. Stemming algorithms reduce
11
completed. Now the document collection can be indexed. The figure
results. The figure below [1] shows an inverted index for the collection of
index first the terms are sorted in an alphabetical order. In the next step
the corresponding posting for the first term i.e. ‘book’ is stored in the
memory.
postings in the memory. The final result must be the list of documents
which has all the terms in the query. For example consider an example
query ‘pen, flower’ then the result will be Document 4. We can say that
12
Term Document frequency Postings
book Postings 3 1 2 5
box 1 2
flower 3
2 4 5
pen 2 1 4
pencil 4 1 3 4 5
ribbon 3 2 3 5
13
CHAPTER 3
INDEXING
exploring the information from large documents like the World Wide Web
stored to provide high speed and exact information retrieval [14]. Indexer
The examples and algorithms discussed in this chapter are taken from
The data in memory is accessed much faster than the data on disk.
The time taken by a disk head to relocate to a place where the data is
located is known as seek time [14]. The data must not be transferred
from disk during the positioning of the disk head. Therefore it is much
block based as we are reading and writing entire blocks. Here the size of
the blocks is 8 KB to 256 KB. The IR systems use servers with some
14
used as they are much cheaper. The table below shows hardware
assumptions [14].
p low-level operation
(e.g., compare & swap a word) 0.01 µs = 10−8 s
15
construct an index the same procedure is used. Inversion is a familiar
Document Text
In the table above each line indicates a document. The text in each
of the documents contains index terms and each index term appears in
matrix where each row corresponds to one document and each column
16
corresponds to one word. The table below shows the frequency matrix
frequency matrix. It shows the frequency matrix for the given collection
of six documents with all the terms and the document numbers. Here
Term
cold days hot in it like nine old pease porridge pot some the
1 1 - 1 - - - - - 2 2 - - -
2 - - - 1 - - - - 1 1 1 - 1
3 - 1 - - - - 1 1 - - - - -
4 1 - 1 - 2 2 - - - - - 2 -
5 - - - 1 1 1 - - - - 1 1 1
6 - 1 - - - - 1 1 - - - - -
17
Document
Number Term
1 2 3 4 5 6
1 cold 1 - - 1 - -
2 days - - 1 - - 1
3 hot 1 - - 1 - -
4 in - 1 - - 1 -
5 it - - - 2 1 -
6 like - - - 2 1 -
7 nine - - 1 - - 1
8 old - - 1 - - 1
9 pease 2 1 - - - -
10 porridge 2 1 - - - -
11 pot - 1 - - 1 -
12 some - - - 2 1 -
13 the - 1 - - 1 -
18
To create an index the matrix must be transposed i.e. inverted to
form a new version in which the rows are the terms. The inverted file can
next step read the text in the order of the document column by column
at a time and write the matrix to disk row by row in the order of the
terms.
term numbers and the document numbers. It shows some values which
indicate the number of times each term occurs in each document. Here
in the above table the document collection consists of thirteen words and
Set f [d, t] ← 0
19
Set f [d, t] ← f [d, t] + 1
b. For each document if f [d, t] > 0 then add <d, f [d, t]> to the
entry.
index. As all this approach seems to be easy, but in reality this process is
size of the document increases, the size of the frequency matrix also
increases. For example consider that the text Bible has to be inverted.
Collection Bible is the King James Version of the Bible, with each verse
verse number. The Bible contains 31, 101 documents and 8,965 distinct
than 1 Gigabyte. For TREC (Text Retrieval Conference) collection, the size
of the matrix becomes more difficult if a four byte integer is allowed for
20
Supposing that one byte is sufficient to record each within-
document frequency fd, t (for TREC it is not adequate) does not help either
the space requirements for the two collections which are 250 Mbytes and
350 Gbytes respectively and the algorithm still is not viable. Boolean
and the operating system can be responsible to page the array into and
out of memory as required. There will be one page fault for each pointer
virtual address spaces provided to each process [16]. Each process has
one page table and during the execution process it is completely loaded
into the main memory. There are few page tables which cannot be fully
held in main memory as their processes as very large. For example each
address. Consider 4 Kbyte pages then the offset part of virtual address is
12 bits in size then this will leave 20 bits as the selector of the page
directory and a table with 220 entries is not practical. If each page table
21
requires 4 bytes, then a page table with 220 entries requires 4 Mbytes.
Page fault occurs when a page is not in the main memory and later that
In TREC Collection
To read the entire text, parse and filter through the dictionary
takes 5 hours. During this time, the temporary file is written, containing
This takes half hour. The temporary file is sorted, if for 48 Mbytes
seconds. Total sorting is 3 hours. During this internal sorting, the entire
allowed to cover reading and writing. Sorting the temporary file takes 13
hours. Finally, the temporary file is again read, and written to disk. This
takes 1 hour. So the complete inversion takes 20 hours.
Algorithm
1. /* Initialization */
22
Create an empty dictionary structure S. Create an empty
non-decreasing d order.
4. /* Merging */
23
Pair wise merge run in the temporary file until it is one sorted run.
b. Read all triplets < t, d, fd,t > from the temporary file for t.
The sorting algorithm is not efficient for large collections [15]. For
bytes, which requires a total of 8 Gbytes of disk space at the peak of the
process. This accounts to more than 20 times the size of the index that is
eventually produced and 60 percent larger than the text being inverted.
Of course the text being inverted is probably stored compressed and also
the temporary disk space required is more than twice the space required
say that the sort based inversion is suitable for moderate collection of
24
Term Term Number
cold 4
days 9
hot 3
in 5
it 13
like 12
nine 8
old 10
pease 1
porridge 2
pot 7
some 11
the 6
25
<1, 1, 2> <1, 1, 2> <1, 1, 2>
<2, 1, 2> <1, 2, 1> <1, 2, 1>
<3, 1, 1> <2, 1, 2> <2, 1, 2>
<4, 1, 1> <2, 2, 1> <2, 2, 1>
<1, 2, 1> <3, 1, 1> <3, 1, 1>
<2, 2, 1> <4, 1, 1> <3, 4, 1>
<5, 2, 1> <5, 2, 1> <4, 1, 1>
<6, 2, 1> <4, 4, 1>
<7, 2, 1> <6, 2, 1> <5, 2, 1>
<8, 3, 1> <7, 2, 1> <5, 5, 1>
<9, 3, 1> <8, 3, 1> <6, 2, 1>
<10, 3, 1> <9, 3, 1> <6, 5, 1>
<11, 4, 2> → <10, 3, 1> → <7, 2, 1>
<12, 4, 2> <11, 4, 2> <7, 5, 1>
<13, 4, 2> <12, 4, 2> <8, 3, 1>
<3, 4, 1> <8, 6, 1>
<4, 4, 1> <3, 4, 1> <9, 3, 1>
<11, 5, 1> <4, 4, 1> <9, 6, 1>
<12, 5, 1> <5, 5, 1> <10, 3, 1>
<13, 5, 1> <11, 5, 1> <10, 6, 1>
<5, 5, 1> <12, 5, 1> <11, 4, 2>
<6, 5, 1> <13, 4, 2> <11, 5, 1>
<7, 5, 1> <13, 5, 1> <12, 4, 2>
<8, 6, 1> <12, 5, 1>
<9, 6, 1> <6, 5, 1> <13, 4, 2>
<10, 6, 1> <7, 5, 1> <13, 5, 1>
<8, 6, 1>
<9, 6, 1>
<10, 6, 1>
26
CHAPTER 4
DYNAMIC INDEXING
were static. Most of the indexing techniques are ‘static’ as they are
files are read during the first phase. In the next phase these temporary
internal files are optimized to prepare for retrieval. Once the optimization
is finished the indices are static so, it is not possible to add new
documents without rebuilding the whole new index. Also, the queries for
indexing is performed.
document collections which are dynamic. Static indexing can be used for
document collections which do not change and remain same and we find
such collections in rare cases. Each time for a query to be retrieved the
indexes which are present in the index files are checked without the
the words are stored in an index file which is organized into a set of fixed
length blocks. These blocks are the ones which pack the postings for
much of the words together with free space being more or less. The block
numbers for each posting are stored in an address record table. A free
27
block list is kept for the blocks which contain enough amount of free
allocated with blocks of index file to the postings. The index file is
allocated with a predetermined initial block size and further the block is
level, each block is divided into n blocks of equal size [19]. The size of the
initial block is the sum of the sizes of blocks in each of the successive
levels.
and updated i.e. which change frequently. Whenever new documents are
added to the database then the collection of indexes becomes large and it
takes time for the index file to get updated. Blocks of index files are
allocated to the postings for words that are contained in the index file in
an information retrieval interface. The word in the first block of the index
28
are updated consist of some additional postings for the word in the
block
free space
....
record
records
record
record
postings for the word. The free block list contains information which
29
indicates whether or not a block is free. The postings for the word are
moved from the first block to the second block by the information
retrieval interface.
some records and some free space. The number of records stored in the
block and also for each record stored, the record number and an address
within the block for the record is listed by the block address table [19].
The records themselves are packed at the other end of the block from the
block table, and there is some free space between the block table and the
for each record number, the block number currently containing that
record. Also in memory is a free list that describes blocks that currently
have an amount of free space greater than a given tolerance. Finally, the
current last block of the file is kept in main memory rather than on disk.
The record address table is used to find the correct block number
to access a record with given ordinal record number. The block address
table searches for the record number and the whole block is read into
memory. This yields the address of the record within the block, so the
record can then be located and the contents used. Now consider the
the block contains sufficient free space such that the records are linearly
shifted in the block to make the correct space, the extension added to the
30
record, the block table updated, and the altered block written. This may
If there is insufficient free space within the original block, then the
smallest record can be deleted whose removal leaves enough space for
the extended record from the block. If there is no such record, then the
record being extended is removed. Again, the block table and free list
should be updated, and the block must be written back to the disk. In
this case, however, still extant is a record that has no block i.e. either a
record that was removed to make space for the extension or the newly
there is any block in the file that has space. If there is, that block is
retrieved from disk; the record is inserted; the block table, record address
table, and free list are all updated and the block is written back to the
disk. In this case, however, still extant is a record that has no block i.e.
either a record that was removed to make space for the extension or the
any block in the file that has space for it. If there is, that block is
retrieved from disk; the record is inserted; the block table, record address
table, and free list are all updated; and the block is written back to disk.
If there is not if all the blocks on the free list are sufficiently full that they
31
inserted and the various tables are updated. If it cannot, the last block is
written and perhaps added to the free list, and a new, completely empty,
last block created in memory. Finally, the record can be inserted into this
empty block.
In most cases, a record extension can be carried out with one block
read and one block write. The worst that can be required is four disk
operations: a block read to been removed from that block; a block read
to retrieve a block that does have enough is sufficiently high that in this
changes with new information is being discovered and added. For such
collections, each time the posting lists and the dictionary should be
updated whenever there are any changes made to the collection. These
beginning. This can easily be done if the modifications are small and the
inserted into a database, then the postings of all these words are
32
append operation. Also, the ‘edit’ operation is important using which
for storing new documents which is stored in main memory and second
a search process in both the indexes and the final results are merged.
documents. The search final results are displayed after removing the
becomes too large and the cost of merging depends upon the storage of
the index in the file system. The merge includes only extending each of
the auxiliary index postings list with its corresponding postings lists of
the main index, if each postings list is stored as a separate file. The
auxiliary index is mainly used to reduce the number of disk seeks that
are necessary over time. We require Mave disk seeks to update each
disk for with an auxiliary index, when the main index and auxiliary index
33
are being merged. Large number of files cannot be handled by most of
2 if |Z0| = n
3 then for i ← 0 to ∞
4 do if Ii ∈ indexes
10 BREAK
11 Z0 ← Φ
LOGARITHMICMERGE ()
2 indexes ← Φ
3 while true
34
In this algorithm each token is added to Z0, the in-memory index
during each [T/n] merges. Here n represents the size of the auxiliary
index and T represents the total number of postings. Here the docIDs are
say that (T2/n) gives the overall time complexity. For this purpose, it
can be said that the postings list is nothing but a list of docIDs.
token indexes
in-memory
size: 20n 20n
21n
Z0 I0 I1
Merge(I0, Z0)
Z1
21 n
by advancing log2 (T/n) with indexes I0, I1, I2, I3,.... with sizes 20 x n, 21 x
35
n, 22 x n, 23 x n …. On each level the postings are processed only once
logarithmic merging.
call as Z0. After reaching a limit n, a new index I0 is created on the disk.
indexes
in-memory
size: 20n
Z0 I0 I1 I2
and the 20 x n postings in Z0 are transferred into this new index I0. If Z0
servicing all currently valid indexes Ii and search results on disk and
36
Each posting is processed only once on each of log (T/n) levels
efficiency gain can be traded for a slowdown of query processing and the
main and auxiliary indexes. The very large indexes should be merged
occasionally in the auxiliary scheme and this results in slow down of the
search system during the merge. This process occurs less frequently and
Z0 I0 I1 I2
21n 22n
The figure 4.5 below shows a block structure where each block
represents a portion of memory for the index file [19]. Each level consists
37
of blocks which are of equal sizes. The index file initially consists of a
block with some predetermined size and it is divided into various blocks
of successive sizes. The figure ranges from high level to low level. Here
101
Level n
Level n-1
103 . 105
107 109
Level 1
Level 0
111 113
the higher level is level n and the lower level is level 0. The higher level
consists of a single block which is large in size whereas the lower level
blocks with minimum size. Here the block 101 is in the higher level i.e.
38
level n and in level n-1; it is portioned into two blocks 103 and 105 of
equal sizes. Each higher block is partitioned into two blocks of equal
sizes. In the level 1 the block 107 is partitioned into two blocks 111 and
113 of equal sizes in the lower level i.e. level 0. The blocks 103, 105, 111
and 113 are the child blocks whereas the blocks 101 and 107 are the
parent blocks. The size of parent blocks is twice the size of the child
blocks as each parent is divided into two child blocks. The size of the
block in higher level n is 2n, the size of the block in lower level 0 is 20 i.e.
1 and the size of the block in level 1 is 21 i.e. 2. The parent block is thrice
the size of the child block if it has 3 children. The size of the block in
lower level 0 will be 1 and the size of the block in the higher level n will
be 3n. When the index file is opened then the information is read from
the secondary memory into the main memory and when the index file is
closed then the information kept in the main memory is written back to
postings list for a word from the document collection. Consider the
collection.
39
Now consider the word ‘box’ is seen in many documents. It appears
in doc1, doc3 and doc4 respectively so, the postings of ‘box’ are [doc1,
doc3, doc4]. The index file is partitioned into blocks to store the postings
of ‘box’.
311
Level 4
309
Level 3
301
Level 2
box
2. The largest block 311 in the level 4 is deleted and a new block 309 and
The figure 4.7 below shows the indexing file allocation two words.
Now consider the word ‘duck’ which is present in doc2 and doc3 and the
postings for ‘duck’ are [doc2, doc3] respectively. The block 301 is
partitioned into blocks 303 and 305 in level 1. The posting for ‘duck’ are
allocated to block 303 in level 1. The block 301 is removed from the free
40
Level 4
Level 3
307 301
Level 2
Level 1 duck
when there are multiple indexes. For example, the spelling correction
algorithm gets affected which selects the corrected alternative with the
most hits [14]. It is no longer a simple lookup for the correct number of
hits for a term with multiple indexes and an invalidation bit vector. In
index maintenance, distribution etc. are more complex. Some of the large
Finally while processing a query the old index is deleted and searched
41
CHAPTER 5
and this can be done using dynamic indexing and modifications made in
documents are added to the database and the old one is deleted.
Different operations can be used in building the index like insert, delete,
update etc.
42
after they are indexed, which typically can take a small fraction of a
second.
43
BIBLIOGRAPHY
44
12. HappyCoders,‘TokenizingJavasourcecode’(n.d.)
http://www.java.happycodings.com/Core_Java/code84.html
21. Dik L. Lee, Huei Chuang, Kent Seamons, ’Document Ranking and
the Vector- Space Model’, IEEE, Volume 14, Issue 2, Page(s): 67 -
75, March/April 1997.
22. Kazem Taghva, Julie Borsack and Allen Condit, ‘Effects of OCR
Errors on Ranking and Feedback Using the Vector Space Model’,
Information Science Research Institute, UNLV, Inf. Proc. and
Management, 32(3): 317-327, 1996.
45
VITA
Graduate College
University of Nevada, Las Vegas
Viswada Sripathi
Degrees:
Bachelor of Technology in Computer Science and Engineering, 2008
Jawaharlal Nehru Technological University, India
46