Inverted Index
Dr. Subrat Kumar Nayak
Associate Professor
Dept. of CSE, ITER, SOADU
Inverted Index
This is the first major concepts of information retrieval.
The name is actually redundant: an index always maps back from
terms to the parts of a document where they occur.
Inverted Index, sometimes coined as Inverted file.
Inverted Index is used keep a dictionary of terms. Then for each
term, we have a list that records which documents the term occurs
in.
Each item in the list, which records that a term appeared in a
document is conventionally called a posting.
The list is then called a postings list (or inverted list), and all the
postings lists taken together are referred to as the postings.
dictionary term is used for the data structure and vocabulary for the set
of terms
Inverted Index
For each term t, we must store a list of all documents that contain t.
Identify each doc by a docID, a document serial number
Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
What happens if the word Caesar is added
to document 14?
Inverted Index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion Posting
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sorted by docID (more later on why).
Inverted Index (Construction)
Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
Tokenization
Cut character sequence into word tokens
Deal with “John’s”, a state-of-the-art solution
Normalization
Map text and query term to same form
You want U.S.A. and USA to match
Stemming
We may wish different forms of a root to match
authorize, authorization
Stop words
We may omit very common words (or not)
the, a, to, of
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
Doc 1 Doc 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Indexer steps: Sort
Sort by terms
At least conceptually
And then docID
Core indexing step
Indexer steps: Dictionary & Postings
Multiple term entries in a
single document are
merged.
Split into Dictionary and
Postings
Doc. frequency information
is added.
Why frequency?
Will discuss later.
Where do we pay in storage?
Lists of
docIDs
Terms
and
counts
IR system
implementation
• How do we index
efficiently?
• How much
storage do we
need?
Pointers 10
Inverted Index
Inverted index works much better than the Boolean retrieval method.
Sorting based inverted indexing is more efficient than the inverted indexing
method since least work needs to be done.
Query processing with an inverted index
How do we process a query? Our focus
Later – what kinds of queries can we process?
Query processing: AND
Consider processing the query:
Brutus AND Caesar
Locate Brutus in the Dictionary;
Retrieve its postings.
Locate Caesar in the Dictionary;
Retrieve its postings.
“Merge” the two postings (intersect the document sets):
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
The merge
Walk through the two postings simultaneously, in time linear
in the total number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Intersecting two postings lists
(a “merge” algorithm)
The merge
Walk through the two postings simultaneously, in time linear
in the total number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Boolean queries: Exact match
The Boolean retrieval model is being able to ask a query that is a
Boolean expression:
Boolean Queries are queries using AND, OR and NOT to join query terms
Views each document as a set of words
Is precise: document matches condition or not.
Perhaps the simplest model to build an IR system on
Primary commercial retrieval tool for 3 decades.
Many search systems you still use are Boolean:
Email, library catalog, macOS Spotlight
Query optimization
Query optimization is the process of selecting how to organize the work of
answering a query so that the least total amount of work needs to be done by the
system.
What is the best order for query processing?
Consider a query that is an AND of n terms.
For each of the n terms, get its postings, then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Query: Brutus AND Calpurnia AND Caesar
Query optimization example
Process in order of increasing freq:
start with smallest set, then keep cutting further.
This is why we kept
document freq. in dictionary
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Execute the query as (Calpurnia AND Brutus) AND Caesar.
More general optimization
e.g., (madding OR crowd) AND (ignoble OR strife)
Get doc. freq.’s for all terms.
Estimate the size of each OR by the sum of its doc. freq.’s
(conservative).
Process in increasing order of OR sizes.
Algorithm to Intersect n terms.