Advanced Indexing Issues
1
Additional Indexing Issues
• Indexing for queries about the Web structure
• Indexing for queries containing wildcards
• Preprocessing of text
• Spelling correction
2
Indexing the Web Structure
3
Connectivity Server
• Support for fast queries on the web graph
– Which URLs point to a given URL?
– Which URLs does a given URL point to?
• Applications
– Crawl control
– Web graph analysis (Connectivity, crawl optimization)
– Link analysis
4
WebGraph
• The WebGraph Framework I: Compression
Techniques. Boldi and Vigna, WWW2004
• Goal: maintain node adjacency lists in
memory
– For this, compressing the adjacency lists is the
critical component
5
Adjacency lists: Naïve Solution
• The set of neighbors of a node
• Assume each URL represented by an integer
– for a 118 million page web, need 27 bits per node
• Naively, this demands 54 bits to represent each
hyperlink
– When stored in an adjacency list, we cut size in half.
Why?
Method we will see achieved:
118M nodes, 1G links with an
average of 3 bits per link!!! 6
Observed Properties of Links
• Locality: Most links lead the user to a page on
the same host = if URLs are stored
lexicographically the index and source and
target are close
• Similarity: URLs that are lexicographically
close have many common successors
• Consecutivity: Often many successors of a
page have consecutive URLs
7
Naïve Representation
• Why do we need out-degree?
8
Gap Compression
Successor list of x: S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For the first entry, we actually store:
– 2x if x>=0
Why?
–2|x| -1 if x<0
Reference Compression
• Instead of representing S(x), we can code it as a
modified version of S(y), for some y<x
• y-x is the reference number
• The copy list is a bit sequence the length of S(y) which
indicates which of the values in S(y) are also in S(x)
• Also store a list of extra nodes S(x)-S(y)
• Usually we only consider references that are within a
sliding window
10
Reference Compression
Ref of 0 indicates that no referencing is used, i.e., no copy list
Differential Compression
• Do not store a copy list that is the length of S(y)
• Look at the copy list as a an alternating sequence of 1
and 0 blocks
• Store the number of blocks
• Store the length of the first block
• Store the lengths - 1 of all other blocks besides the last
block (why?)
• Always assume that the first block is of 1 (may be a
block of length 0)
12
Using copy blocks
Compressing Extra Nodes List
• Use consecutivity to compress list of extra nodes
– We find subsequences containing consecutive numbers of length
>= given threshold Lmin
• Store a list of integer intervals: for each interval we store
left extreme and length
– Left extremes are compressed using differences between left
extremes – 2
– Intervals are decremented by Lmin
• Store a list of residuals compressed using gaps
(and using a variable length encoding)
14
Compressing intervals
Important property: These series of numbers are self-
decodable
Note: Node number is not actually stored
Indexing for Wildcard Queries
16
Finding Lexicon Entries
• Suppose that we are given a term t, we can find its
entry in the lexicon using binary search
• What happens if we are given a term with a wild
card? How can we find the following terms in the
lexicon?
– lab*
– *or
– lab*r
17
Indexing Using n-grams
• Decompose all terms into n-grams for some small
value of n
– n-grams are sequences of n letters in a word
– use $ to mark beginning and end of word
– digram = 2-gram
• Example: Digrams of labor are $l, la, ab, bo, r$
• We store an additional structure, that, for each digram,
points has a list of terms containing that digram
– actually, store a list of pointers to entries in the lexicon
18
Digram Term numbers
Example $a 1
$b 2
Term Number Term $l 3,4,5,6,7
1 abhor $s 8
2 bear aa 3
3 laaber ab 1,3,4,5,6,7,8
4 labor bo 4,5,6
5 laborator la 3,4,5,6,7,8
6 labour or 1,4,5
7 lavacaber ou 6
8 slab ra 5
ry 5
alphabetically
Can you fill r$
sorted
this in? sl
19
Term Number Term Digram Term numbers
1 abhor $a 1
2 bear $b 2
3 laaber $l 3,4,5,6,7
4 labor $s 8
5 laborator aa 3
6 labour ab 1,3,4,5,6,7,8
7 lavacaber bo 4,5,6
8 slab la 3,4,5,6,7,8
or 1,4,5
To find lab*r, you would
ou 6
look for terms in
common with $l, la, ab, ra 5
r$ and then post- ry 5
process to ensure that r$
the term does match sl
20
Indexing Using Rotated Lexicons
• In wildcard queries are common, we can save on
time at the cost of more space
• Rotated indexed can find the matches of any
wildcard query with a single wildcard in one binary
search
• An index entry is stored for each letter of each term
– labor, would have 6 pointers: one for each letter + one
for the beginning of the word
21
Partial Example
Term Number Term Rotated Form Address
1 abhor $abhor (1,0)
2 bear $bear (2,0)
4 labor $labor (4,0)
6 labour $labour (5,0)
abhor$ (1,1)
Note: We do not abor$l (4,2)
actually store the
abour$l (6,2)
rotated string in the
r$abho (1,5)
rotated lexicon. The
pair of numbers is r$bea (2,4)
enough for binary r$labo (4,5)
search r$labou (6,6)
22
Partial Example
Term Number Term Rotated Form Address
1 abhor $abhor (1,0)
2 bear $bear (2,0)
4 labor $labor (4,0)
6 labour $labour (5,0)
abhor$ (1,1)
How would you find abor$l (4,2)
the terms for:
abour$l (6,2)
lab*
r$abho (1,5)
*or
*ab* r$bea (2,4)
l*r r$labo (4,5)
l*b*r r$labou (6,6)
23
Summary
• We now know how to find terms that match a
wildcard query
• Basic steps for query evaluation for a wildcard query
– lookup the wildcard-ed words in an auxiliary index, to find
all possible matching terms
– given these terms proceed with normal query processing
(as if this was not a wildcard query)
• But, we have not yet explained how normal query
processing should proceed! This is the next topic.
24
Homework
• We would like to support retrieval of documents
containing phrases
– E.g., given “information retrieval”, return all documents
containing “information” and then “retrieval” immediately
afterwards
• Proposed Solution: Biword index.
– Should efficiently support the following: given two
wordIds w1, w2, efficiently return all words containing w1
and then immediately afterwards w2
25
Homework (cont)
1. Suggest a data structure for storing the biword index.
Describe the structure precisely.
2. Draw the structure assuming that the following
documents are given:
– בהצלחה רבה במבחן
– עידו עבר בהצלחה רבה
– קל במבחן קל
3. Analyze the search complexity of the structure.
4. Analyze the size complexity of the structure.
26
Preprocessing the Data
27
Choosing What Data To Store
• We would like the user to be able to get as many
relevant answers to his query
• Examples:
– Query: computer science. Should it match Computer
Science?
– Query: data compression. Should it match compressing
data?
– Query: Amir Perez. Should it match Amir Peretz?
• The way we store the data in our lexicon will affect
our answers to the queries
28
Case Folding
• Normally accepted to perform case folding,
i.e., to reduce all words to lower-case form
before storing in the lexicon
• Use’s query is transformed to lower case
before looking for the terms in the index
• What affect does this have on the lexicon
size?
29
Stemming
• Suppose that a user is interested in finding pages
about “running shoes”
– We may also want to return pages with shoe
– We may also want to return pages with run or runs
• Solution: Use a stemmer
– Stemmer returns the stem ( )שורשof the word
• Note: This means that more relevant answers will
be returned, as well as more irrelevant answers!
– Example: cleary AND witten => clear AND wit
30
Porter Stemmer
• A multi-step, longest-match stemmer.
– Paper introducing this stemmer can be found online
• Notation
–v vowel(s)
– cconstant(s)
– (vc)m vowel(s) followed by constant(s), repeated m times
• Any word can be written: [c](vc)m[v]
– brackets are optional
– m is called the measure of the word
• We discuss only the first few rules of the stemmer
31
Porter Stemmer: Step 1a
Follow first applicable rule
Suffix Replacement Examples
sses ss caresses => caress
ies i ponies => poni
ties => ti
ss ss caress => caress
s null cats => cat
32
Porter Stemmer: Step 1b
Follow first applicable rule
Conditions Suffix Replacement Examples
(m > 0) eed ee feed -> feed
agreed -> agree
(*v*) ed null plastered -> plaster
bled -> bled
(*v*) ing null motoring -> motor
sing -> sing
*v* - the stem contains a vowel
33
Stop Words
• Stop words are very common words that generally
are not of importance, e.g.: the, a, to
• Such words take up a lot of room in the index
(why?)
• They slow down query processing (why?)
• They generally do not improve the results (why?)
• Some search engines do not store these words at
all, and remove them from queries
– Is this always a good idea?
34
Spelling correction
Spell correction
• Two principal uses:
1. Correcting document(s) being indexed
2. Retrieve matching documents when query contains
a spelling error
• Types of spelling correction:
– Isolated word spelling correction
– Context-sensitive spelling correction
– Only latter will catch: I flew form Heathrow to Narita.
Document correction
• Primarily for OCR’ed documents
– Correction algorithms tuned for this
• Goal: the index (dictionary) contains fewer OCR-
induced misspellings
• Can use domain-specific knowledge
– E.g., OCR can confuse O and D more often than it would
confuse O and I (adjacent on the QWERTY keyboard, so
more likely interchanged in typing).
Query mis-spellings
• Our principal focus here
– E.g., the query carot
• We can either
– Retrieve documents indexed by the correct
spelling, OR
– Return several suggested alternative queries with
the correct spelling
• Did you mean … ?
Isolated word correction
• Fundamental premise – there is a lexicon
from which the correct spellings come
• Two basic choices for this
– A standard lexicon such as (e.g., Webster’s
English Dictionary, industry-specific lexicon)
– The lexicon of the indexed corpus (e.g., all
words on the web, including the mis-spellings)
Isolated word correction
• Problem: Given a lexicon and a character
sequence Q, return the words in the lexicon
closest to Q
• What’s “closest”?
• We’ll study several alternatives
– Edit distance
– Weighted edit distance
– n-gram overlap
Edit distance
• Edit Distance: Given two strings S1 and S2, the
minimum number of basic operations to convert one
to the other
• Basic operations are typically character-level
– Insert, Delete. Replace
• The edit distance from cat to dog is 3.
– What is the edit distance from cat to dot?
– What is the maximal edit distance between s and t?
• Generally found by dynamic programming.
Computing Edit Distance:
Intuition
• Suppose we want to compute edit distance of s and t
• Create a matrix d with 0,…,|s| columns and 0,…,|t|
rows
• The entry d[i,j] is the edit distance between the
words:
– sj (i.e. prefix of s of size j)
– ti (i.e. prefix of s of size i)
• Where will the edit distance of s and t be placed?
Computing Edit Distance (1)
To compute the edit distance of s and t:
1. n := length(s)
m := length(t)
If n = 0, return m and exit.
If m = 0, return n and exit.
2. Construct a matrix containing 0..m rows and 0..n
columns.
Initialize the first row to 0..n.
Initialize the first column to 0..m.
Computing Edit Distance (2)
3. for each character of s (i from 1 to n).
4. for each character of t (j from 1 to m).
5. If s[i] equals t[j], c := 0.
If s[i] doesn't equal t[j], c := 1.
6. d[i,j] := min(d[i-1,j]+1,
d[i,j-1]+1,
d[i-1,j-1]+c).
7. Return d[n,m].
Example: s=GUMBO,
t=GAMBOL
• Steps 1 and 2: G U M B O
0 1 2 3 4 5
G 1
A 2
M 3
Continue on
blackboard! B 4
O 5
L 6
Weighted edit distance
• As above, but the weight of an operation depends
on the character(s) involved
– Meant to capture keyboard errors, e.g. m more likely to be
mis-typed as n than as q
– Therefore, replacing m by n is a smaller edit distance than
by q
– (Same ideas usable for OCR, but with different weights)
• Require weight matrix as input
• Modify dynamic programming to handle weights
Edit distance to all dictionary
terms?
• Given a (mis-spelled) query – do we
compute its edit distance to every dictionary
term?
– Expensive and slow
• How do we cut the set of candidate
dictionary terms?
• Here we use n-gram overlap for this
n-gram overlap
• Enumerate all the n-grams in the query
string as well as in the lexicon
• Use the n-gram index (recall wild-card
search) to retrieve all lexicon terms matching
any of the query n-grams
• Threshold by number of matching n-grams
Example with trigrams
• Suppose the text is november
– Trigrams are nov, ove, vem, emb, mbe, ber.
• The query is december
– Trigrams are dec, ece, cem, emb, mbe, ber.
• So 3 trigrams overlap (of 6 in each term)
• How can we turn this into a normalized
measure of overlap?
One option – Jaccard
coefficient
• A commonly-used measure of overlap
• Let X and Y be two sets; then the J.C. is
X Y / X Y
• Equals 1 when X and Y have the same elements and
zero when they are disjoint
• X and Y don’t have to be of the same size
• Always assigns a number between 0 and 1
– Now threshold to decide if you have a match
– E.g., if J.C. > 0.8, declare a match
Can we compute Jaccard
Matching trigrams
coefficient if words are not
stored in n-gram index?
• Consider the query lord – we wish to identify
words matching 2 of its 3 bigrams (lo, or, rd)
lo alone lord sloth
or border lord morbid
rd ardent border card
Standard postings “merge” will enumerate …
Adapt this to using Jaccard (or another) measure.
Computational cost
• Spell-correction is computationally
expensive
• Avoid running routinely on every query?
– Run only on queries that matched few docs