Search Engines
Information Retrieval in Practice
All slides ©Addison Wesley, 2008
With changes by Crista Lopes
Simple Inverted
Index
Index Construction
• Simple in-memory indexer
List<Posting>()
[Link](Posting(n))
Write to a file
Architecture
Index Creation Index Ranking
Log
Querying Process
Preprocessing Steps
Text Transformation Evaluation
Local
Text Acquisition Document UI
Store
Web Pages
Query Processing
• Document-at-a-time
– Calculates complete scores for documents by
processing all term lists, one document at a time
• Term-at-a-time
– Accumulates scores for documents by processing
term lists one at a time
• Both approaches have optimization
techniques that significantly reduce time
required to generate scores
Document-At-A-Time
Pseudocode Function Descriptions
• getCurrentDocument()
– Returns the document number of the current posting of the inverted
list.
• skipForwardToDocument(d)
– Moves forward in the inverted list until getCurrentDocument() <= d.
This function may read to the end of the list.
• movePastDocument(d)
– Moves forward in the inverted list until getCurrentDocument() < d.
• moveToNextDocument()
– Moves to the next document in the list. Equivalent to
movePastDocument(getCurrentDocument()).
• getNextAccumulator(d)
– returns the first document number d' >= d that has already has an
accumulator.
• removeAccumulatorsBetween(a, b)
– Removes all accumulators for documents numbers between a and b.
Ad will be removed iff a < d < b.
Document-At-A-Time
Term-At-A-Time
Term-At-A-Time
Optimization Techniques
• Term-at-a-time uses more memory for
accumulators, but accesses disk more
efficiently
• Two classes of optimization
– Read less data from inverted lists
• e.g., skip lists
• better for simple feature functions
– Calculate scores for fewer documents
• e.g., conjunctive processing
• better for complex feature functions
Conjunctive
Term-at-a-Time
Conjunctive
Document-at-a-Time
Threshold Methods
• Threshold methods use number of top-ranked
documents needed (k) to optimize query
processing
– for most applications, k is small
• For any query, there is a minimum score that each
document needs to reach before it can be shown
to the user
– score of the kth-highest scoring document
– gives threshold τ
– optimization methods estimate τ′ to ignore
documents
Threshold Methods
• For document-at-a-time processing, use score
of lowest-ranked document so far for τ′
– for term-at-a-time, have to use kth-largest score in
the accumulator table
• MaxScore method compares the maximum
score that remaining documents could have to
τ′
– safe optimization in that ranking will be the same
without optimization
MaxScore Example
• Indexer computes μtree
– maximum score for any document containing just “tree”
• Assume k =3, τ′ is lowest score after first three docs
• Likely that τ ′ > μtree
– τ ′ is the score of a document that contains both query
terms
• Can safely skip over all gray postings
Other Approaches
• Early termination of query processing
– ignore high-frequency word lists in term-at-a-time
– ignore documents at end of lists in doc-at-a-time
– unsafe optimization
• List ordering
– order inverted lists by quality metric (e.g.,
PageRank) or by partial score
– makes unsafe (and fast) optimizations more likely
to produce good documents