0% found this document useful (0 votes)
48 views52 pages

Advanced Indexing Issues

1. Advanced indexing techniques allow for more complex queries on text data, including queries with wildcards and about web structure. 2. Compressing adjacency lists is critical for indexing the web graph in memory. Techniques like gap compression, reference compression, and differential compression achieve high compression rates. 3. Preprocessing text through techniques like spelling correction and stemming allows queries to match more relevant results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views52 pages

Advanced Indexing Issues

1. Advanced indexing techniques allow for more complex queries on text data, including queries with wildcards and about web structure. 2. Compressing adjacency lists is critical for indexing the web graph in memory. Techniques like gap compression, reference compression, and differential compression achieve high compression rates. 3. Preprocessing text through techniques like spelling correction and stemming allows queries to match more relevant results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 52

Advanced Indexing Issues

1
Additional Indexing Issues

• Indexing for queries about the Web structure


• Indexing for queries containing wildcards
• Preprocessing of text
• Spelling correction

2
Indexing the Web Structure

3
Connectivity Server

• Support for fast queries on the web graph


– Which URLs point to a given URL?
– Which URLs does a given URL point to?

• Applications
– Crawl control
– Web graph analysis (Connectivity, crawl optimization)
– Link analysis

4
WebGraph

• The WebGraph Framework I: Compression


Techniques. Boldi and Vigna, WWW2004
• Goal: maintain node adjacency lists in
memory
– For this, compressing the adjacency lists is the
critical component

5
Adjacency lists: Naïve Solution

• The set of neighbors of a node


• Assume each URL represented by an integer
– for a 118 million page web, need 27 bits per node

• Naively, this demands 54 bits to represent each


hyperlink
– When stored in an adjacency list, we cut size in half.
Why?
Method we will see achieved:
118M nodes, 1G links with an
average of 3 bits per link!!! 6
Observed Properties of Links

• Locality: Most links lead the user to a page on


the same host = if URLs are stored
lexicographically the index and source and
target are close
• Similarity: URLs that are lexicographically
close have many common successors
• Consecutivity: Often many successors of a
page have consecutive URLs
7
Naïve Representation

• Why do we need out-degree?

8
Gap Compression

Successor list of x: S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}


For the first entry, we actually store:
– 2x if x>=0
Why?
–2|x| -1 if x<0
Reference Compression

• Instead of representing S(x), we can code it as a


modified version of S(y), for some y<x
• y-x is the reference number
• The copy list is a bit sequence the length of S(y) which
indicates which of the values in S(y) are also in S(x)
• Also store a list of extra nodes S(x)-S(y)
• Usually we only consider references that are within a
sliding window

10
Reference Compression

Ref of 0 indicates that no referencing is used, i.e., no copy list


Differential Compression
• Do not store a copy list that is the length of S(y)

• Look at the copy list as a an alternating sequence of 1


and 0 blocks
• Store the number of blocks
• Store the length of the first block
• Store the lengths - 1 of all other blocks besides the last
block (why?)
• Always assume that the first block is of 1 (may be a
block of length 0)
12
Using copy blocks
Compressing Extra Nodes List
• Use consecutivity to compress list of extra nodes
– We find subsequences containing consecutive numbers of length
>= given threshold Lmin

• Store a list of integer intervals: for each interval we store


left extreme and length
– Left extremes are compressed using differences between left
extremes – 2
– Intervals are decremented by Lmin

• Store a list of residuals compressed using gaps


(and using a variable length encoding)
14
Compressing intervals

Important property: These series of numbers are self-


decodable
Note: Node number is not actually stored
Indexing for Wildcard Queries

16
Finding Lexicon Entries

• Suppose that we are given a term t, we can find its


entry in the lexicon using binary search
• What happens if we are given a term with a wild
card? How can we find the following terms in the
lexicon?
– lab*
– *or
– lab*r

17
Indexing Using n-grams
• Decompose all terms into n-grams for some small
value of n
– n-grams are sequences of n letters in a word
– use $ to mark beginning and end of word
– digram = 2-gram
• Example: Digrams of labor are $l, la, ab, bo, r$
• We store an additional structure, that, for each digram,
points has a list of terms containing that digram
– actually, store a list of pointers to entries in the lexicon

18
Digram Term numbers
Example $a 1
$b 2
Term Number Term $l 3,4,5,6,7
1 abhor $s 8
2 bear aa 3
3 laaber ab 1,3,4,5,6,7,8
4 labor bo 4,5,6
5 laborator la 3,4,5,6,7,8
6 labour or 1,4,5
7 lavacaber ou 6
8 slab ra 5
ry 5
alphabetically
Can you fill r$  
sorted
this in? sl  
19
Term Number Term Digram Term numbers
1 abhor $a 1
2 bear $b 2
3 laaber $l 3,4,5,6,7
4 labor $s 8
5 laborator aa 3
6 labour ab 1,3,4,5,6,7,8
7 lavacaber bo 4,5,6
8 slab la 3,4,5,6,7,8
or 1,4,5
To find lab*r, you would
ou 6
look for terms in
common with $l, la, ab, ra 5
r$ and then post- ry 5
process to ensure that r$  
the term does match sl  
20
Indexing Using Rotated Lexicons

• In wildcard queries are common, we can save on


time at the cost of more space
• Rotated indexed can find the matches of any
wildcard query with a single wildcard in one binary
search
• An index entry is stored for each letter of each term
– labor, would have 6 pointers: one for each letter + one
for the beginning of the word

21
Partial Example
Term Number Term Rotated Form Address
1 abhor $abhor (1,0)
2 bear $bear (2,0)
4 labor $labor (4,0)
6 labour $labour (5,0)
abhor$ (1,1)
Note: We do not abor$l (4,2)
actually store the
abour$l (6,2)
rotated string in the
r$abho (1,5)
rotated lexicon. The
pair of numbers is r$bea (2,4)
enough for binary r$labo (4,5)
search r$labou (6,6)
22
Partial Example
Term Number Term Rotated Form Address
1 abhor $abhor (1,0)
2 bear $bear (2,0)
4 labor $labor (4,0)
6 labour $labour (5,0)
abhor$ (1,1)
How would you find abor$l (4,2)
the terms for:
abour$l (6,2)
lab*
r$abho (1,5)
*or
*ab* r$bea (2,4)
l*r r$labo (4,5)
l*b*r r$labou (6,6)
23
Summary

• We now know how to find terms that match a


wildcard query
• Basic steps for query evaluation for a wildcard query
– lookup the wildcard-ed words in an auxiliary index, to find
all possible matching terms
– given these terms proceed with normal query processing
(as if this was not a wildcard query)
• But, we have not yet explained how normal query
processing should proceed! This is the next topic.

24
Homework
• We would like to support retrieval of documents
containing phrases
– E.g., given “information retrieval”, return all documents
containing “information” and then “retrieval” immediately
afterwards

• Proposed Solution: Biword index.


– Should efficiently support the following: given two
wordIds w1, w2, efficiently return all words containing w1
and then immediately afterwards w2

25
Homework (cont)
1. Suggest a data structure for storing the biword index.
Describe the structure precisely.
2. Draw the structure assuming that the following
documents are given:
– ‫בהצלחה רבה במבחן‬
– ‫עידו עבר בהצלחה רבה‬
– ‫קל במבחן קל‬

3. Analyze the search complexity of the structure.


4. Analyze the size complexity of the structure.
26
Preprocessing the Data

27
Choosing What Data To Store
• We would like the user to be able to get as many
relevant answers to his query
• Examples:
– Query: computer science. Should it match Computer
Science?
– Query: data compression. Should it match compressing
data?
– Query: Amir Perez. Should it match Amir Peretz?
• The way we store the data in our lexicon will affect
our answers to the queries
28
Case Folding

• Normally accepted to perform case folding,


i.e., to reduce all words to lower-case form
before storing in the lexicon
• Use’s query is transformed to lower case
before looking for the terms in the index
• What affect does this have on the lexicon
size?
29
Stemming
• Suppose that a user is interested in finding pages
about “running shoes”
– We may also want to return pages with shoe
– We may also want to return pages with run or runs
• Solution: Use a stemmer
– Stemmer returns the stem (‫ )שורש‬of the word
• Note: This means that more relevant answers will
be returned, as well as more irrelevant answers!
– Example: cleary AND witten => clear AND wit

30
Porter Stemmer
• A multi-step, longest-match stemmer.
– Paper introducing this stemmer can be found online
• Notation
–v vowel(s)
– cconstant(s)
– (vc)m vowel(s) followed by constant(s), repeated m times
• Any word can be written: [c](vc)m[v]
– brackets are optional
– m is called the measure of the word
• We discuss only the first few rules of the stemmer

31
Porter Stemmer: Step 1a
Follow first applicable rule

Suffix Replacement Examples

sses ss caresses => caress

ies i ponies => poni


ties => ti
ss ss caress => caress

s null cats => cat

32
Porter Stemmer: Step 1b
Follow first applicable rule
Conditions Suffix Replacement Examples
(m > 0) eed ee feed -> feed
agreed -> agree
(*v*) ed null plastered -> plaster
bled -> bled
(*v*) ing null motoring -> motor
sing -> sing

*v* - the stem contains a vowel

33
Stop Words
• Stop words are very common words that generally
are not of importance, e.g.: the, a, to
• Such words take up a lot of room in the index
(why?)
• They slow down query processing (why?)
• They generally do not improve the results (why?)
• Some search engines do not store these words at
all, and remove them from queries
– Is this always a good idea?

34
Spelling correction
Spell correction
• Two principal uses:
1. Correcting document(s) being indexed
2. Retrieve matching documents when query contains
a spelling error
• Types of spelling correction:
– Isolated word spelling correction
– Context-sensitive spelling correction
– Only latter will catch: I flew form Heathrow to Narita.
Document correction

• Primarily for OCR’ed documents


– Correction algorithms tuned for this

• Goal: the index (dictionary) contains fewer OCR-


induced misspellings
• Can use domain-specific knowledge
– E.g., OCR can confuse O and D more often than it would
confuse O and I (adjacent on the QWERTY keyboard, so
more likely interchanged in typing).
Query mis-spellings

• Our principal focus here


– E.g., the query carot

• We can either
– Retrieve documents indexed by the correct
spelling, OR
– Return several suggested alternative queries with
the correct spelling
• Did you mean … ?
Isolated word correction

• Fundamental premise – there is a lexicon


from which the correct spellings come
• Two basic choices for this
– A standard lexicon such as (e.g., Webster’s
English Dictionary, industry-specific lexicon)
– The lexicon of the indexed corpus (e.g., all
words on the web, including the mis-spellings)
Isolated word correction

• Problem: Given a lexicon and a character


sequence Q, return the words in the lexicon
closest to Q
• What’s “closest”?
• We’ll study several alternatives
– Edit distance
– Weighted edit distance
– n-gram overlap
Edit distance
• Edit Distance: Given two strings S1 and S2, the
minimum number of basic operations to convert one
to the other
• Basic operations are typically character-level
– Insert, Delete. Replace

• The edit distance from cat to dog is 3.


– What is the edit distance from cat to dot?
– What is the maximal edit distance between s and t?

• Generally found by dynamic programming.


Computing Edit Distance:
Intuition
• Suppose we want to compute edit distance of s and t
• Create a matrix d with 0,…,|s| columns and 0,…,|t|
rows
• The entry d[i,j] is the edit distance between the
words:
– sj (i.e. prefix of s of size j)
– ti (i.e. prefix of s of size i)

• Where will the edit distance of s and t be placed?


Computing Edit Distance (1)
To compute the edit distance of s and t:
1. n := length(s)
m := length(t)
If n = 0, return m and exit.
If m = 0, return n and exit.
2. Construct a matrix containing 0..m rows and 0..n
columns.
Initialize the first row to 0..n.
Initialize the first column to 0..m.
Computing Edit Distance (2)
3. for each character of s (i from 1 to n).
4. for each character of t (j from 1 to m).
5. If s[i] equals t[j], c := 0.
If s[i] doesn't equal t[j], c := 1.
6. d[i,j] := min(d[i-1,j]+1,
d[i,j-1]+1,
d[i-1,j-1]+c).

7. Return d[n,m].
Example: s=GUMBO,
t=GAMBOL

• Steps 1 and 2: G U M B O

0 1 2 3 4 5

G 1

A 2

M 3
Continue on
blackboard! B 4

O 5

L 6
Weighted edit distance
• As above, but the weight of an operation depends
on the character(s) involved
– Meant to capture keyboard errors, e.g. m more likely to be
mis-typed as n than as q
– Therefore, replacing m by n is a smaller edit distance than
by q
– (Same ideas usable for OCR, but with different weights)

• Require weight matrix as input


• Modify dynamic programming to handle weights
Edit distance to all dictionary
terms?
• Given a (mis-spelled) query – do we
compute its edit distance to every dictionary
term?
– Expensive and slow

• How do we cut the set of candidate


dictionary terms?
• Here we use n-gram overlap for this
n-gram overlap

• Enumerate all the n-grams in the query


string as well as in the lexicon
• Use the n-gram index (recall wild-card
search) to retrieve all lexicon terms matching
any of the query n-grams
• Threshold by number of matching n-grams
Example with trigrams

• Suppose the text is november


– Trigrams are nov, ove, vem, emb, mbe, ber.

• The query is december


– Trigrams are dec, ece, cem, emb, mbe, ber.

• So 3 trigrams overlap (of 6 in each term)


• How can we turn this into a normalized
measure of overlap?
One option – Jaccard
coefficient
• A commonly-used measure of overlap
• Let X and Y be two sets; then the J.C. is

X Y / X Y
• Equals 1 when X and Y have the same elements and
zero when they are disjoint
• X and Y don’t have to be of the same size
• Always assigns a number between 0 and 1
– Now threshold to decide if you have a match
– E.g., if J.C. > 0.8, declare a match
Can we compute Jaccard
Matching trigrams
coefficient if words are not
stored in n-gram index?

• Consider the query lord – we wish to identify


words matching 2 of its 3 bigrams (lo, or, rd)

lo alone lord sloth


or border lord morbid
rd ardent border card

Standard postings “merge” will enumerate …


Adapt this to using Jaccard (or another) measure.
Computational cost

• Spell-correction is computationally
expensive
• Avoid running routinely on every query?
– Run only on queries that matched few docs

You might also like