Chapter Two
Text Operations
1
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.
A few words are very common.
◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”
3
Sample Word Frequency Data
4
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.
Zipf's Law states that when the distinct words in a text
are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant
That is If the words, w, in a collection are ranked, r,
by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.
6
Zipf ’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law
w has rank r and
frequency f
7
Example: Zipf's Law
The table shows the most frequently occurring words
from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.
Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
For this, Luhn specifies two cut-off points: an upper and a
lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
The words exceeding the upper cut-off were considered to be
common
The words below the lower cut-off were considered to be rare
Hence they are not contributing significantly to the content of the
text
The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas
Luhn (1958) suggested that both extremely common and
extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.
14
Vocabulary Growth: Heaps’ Law
Heap’s law: estimates the number of vocabularies in a
given corpus
◦ The vocabulary size grows by O(n ),
β where β is a constant
between 0 – 1.
◦ If V is the size of the vocabulary and n is the length of the corpus
in words, Heap’s provides the following equation:
Where constants:
◦ K 10−100
◦ 0.4−0.6 (approx. square-root)
V = Kn 15
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
Example: from 1,000,000,000 documents, there
may be 1,000,000 distinct words. Can you agree? 16
Example
We want to estimate the size of the vocabulary
for a corpus of 1,000,000 words. However, we
only know statistics computed on smaller
corpora sizes:
◦ For 100,000 words, there are 50,000 unique words
◦ For 500,000 words, there are 150,000 unique words
◦ Estimate the vocabulary size for the 1,000,000 words
corpus?
◦ How about for a corpus of 1,000,000,000 words? 17
Text Operations
Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content
Therefore, one needs to preprocess the text of a
document in a collection to be used as index terms 18
Text Op….
Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term
19
Text operations is the process of text transformations in to logical
representations
The main operations for selecting index terms are:
Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters
Elimination of stop words - filter out words which are not useful in the retrieval
process
Stemming words - remove affixes (prefixes and suffixes)
Construction of term categorization structures such as thesaurus/wordlist, to capture
relationship for allowing the expansion of the original query with related terms
20
Generating Document Representatives
Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus
Index
terms 21
Lexical Analysis/Tokenization of Text
Change text of the documents into words to be adopted
as index terms
Objective - identify words in the text
◦ Digits, hyphens, punctuation marks, case of letters
◦ Numbers are not good index terms (like 1910, 1999);
but 510 B.C. – unique
22
Lexical Analysis…..
Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens
Punctuation marks – remove totally unless significant,
e.g. program code: [Link] and xexe
Case of letters – not important and can convert all to
upper or lower
23
Analyze text into a sequence of discrete tokens (words).
Tokenization Input:“Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
◦ Friends , and, Romans, Countrymen
Each such token is now a candidate for an index entry,
after further processing
But what are valid tokens to omit? 24
One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard → Hewlett and Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
San Francisco, Los Angeles
Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
data base, database, data-base
• Numbers:
dates (3/12/91 vs. Mar. 12, 1991);
phone numbers,
IP addresses ([Link])
25
Issues in Tokenization
How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
26
Issues in Tokenization
Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately
Issues of tokenization are language specific
◦ Requires the language to be known
27
Exercise: Tokenization
The cat slept peacefully in the living room. It’s a
very old cat.
Mr. O’Neill thinks that the boys’ stories about
Chile’s capital aren’t amusing.
28
Term Weights: Term Frequency
More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
May want to normalize term frequency (tf) by
dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}
29
Term Weights: Inverse Document Frequency
Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a term’s discrimination power.
Log used to dampen the effect relative to tf.
30
TF-IDF Weighting
A typical combined term importance indicator
is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight.
Many other ways of determining term weights
have been proposed.
Experimentally, tf-idf has been found to work
well.
31
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 32
Similarity Measure
A similarity measure is a function that computes
the degree of similarity between two vectors.
Using a similarity measure between the query
and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
33
Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can be
computed as the vector innert product (a.k.a. dot product):
sim(dj,q) = dj•qi =1= ij iq
w w
where wij is the weight of term i in document j and wiq is the weight of term i in
the query
For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
34
Properties of Inner Product
The inner product is unbounded.
Favors long documents with a large number
of unique terms.
Measures how many terms matched but not
how many terms are not matched.
35
36
37