0% found this document useful (0 votes)

40 views37 pages

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views37 pages

Chapter 2 Text Operations

Uploaded by

Dawit Sebhat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter Two

Text Operations

1
Statistical Properties of Text
 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a corpus?
◦ Such factors affect the performance of IR system & can be used to select
suitable term weights & other aspects of the system.

 A few words are very common.

◦ 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
2
Statistical…….
 Most words are very rare.
◦ Half the words in a corpus appear only once, called
“read only once”

3
Sample Word Frequency Data

4
Word distribution: Zipf's Law
 Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
◦ attempts to capture the distribution of the frequencies
(number of occurances ) of the words within a text.

 Zipf's Law states that when the distinct words in a text

are ranked by frequency from most frequent to least
frequent, the product of rank and frequency is a constant.
5
Zipf's Law...
Frequency * Rank = constant

That is If the words, w, in a collection are ranked, r,

by their frequency, f, they roughly fit the relation:
r*f=c
◦ Different collections have different constants c.

6
Zipf ’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly occurring word has rank 1,
etc.)
f Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and

frequency f

7
Example: Zipf's Law

 The table shows the most frequently occurring words

from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words 8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods. 9
Zipf ’s Law Impact on IR
◦ Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-
index storage costs.
◦ Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis
for query expansion) is difficult since they are extremely
rare. 10
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.

 Luhn suggested that both extremely common and extremely

uncommon words were not very useful for indexing.

 For this, Luhn specifies two cut-off points: an upper and a

lower cutoffs based on which non-significant words are
excluded 11
Word significance: Luhn’s Ideas
 The words exceeding the upper cut-off were considered to be
common
 The words below the lower cut-off were considered to be rare
 Hence they are not contributing significantly to the content of the
text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating 12f
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and

extremely uncommon words were not very useful for document
representation & indexing. 13
Vocabulary size : Heaps’ Law
 How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
◦ This determines how the size of the inverted index will
scale with the size of the corpus.

14
Vocabulary Growth: Heaps’ Law
 Heap’s law: estimates the number of vocabularies in a
given corpus
◦ The vocabulary size grows by O(n ),
β where β is a constant
between 0 – 1.
◦ If V is the size of the vocabulary and n is the length of the corpus
in words, Heap’s provides the following equation:
 Where constants:
◦ K  10−100
◦   0.4−0.6 (approx. square-root)

V = Kn 15
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens

Example: from 1,000,000,000 documents, there

may be 1,000,000 distinct words. Can you agree? 16
Example
 We want to estimate the size of the vocabulary
for a corpus of 1,000,000 words. However, we
only know statistics computed on smaller
corpora sizes:
◦ For 100,000 words, there are 50,000 unique words
◦ For 500,000 words, there are 150,000 unique words
◦ Estimate the vocabulary size for the 1,000,000 words
corpus?
◦ How about for a corpus of 1,000,000,000 words? 17
Text Operations
 Not all words in a document are equally significant to
represent the contents/meanings of a document
◦ Some word carry more meaning than others
◦ Noun words are the most representative of a
document content

 Therefore, one needs to preprocess the text of a

document in a collection to be used as index terms 18
Text Op….
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
◦ Preprocessing will lead to an improvement in the information
retrieval performance
 However, some search engines on the Web omit preprocessing
◦ Every word in the document is an index term

19
 Text operations is the process of text transformations in to logical
representations

 The main operations for selecting index terms are:

 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters

 Elimination of stop words - filter out words which are not useful in the retrieval
process

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

relationship for allowing the expansion of the original query with related terms
20
Generating Document Representatives
 Text Processing System
◦ Input text – full text, abstract or title
◦ Output – a document representative adequate for use in an
automatic retrieval system
documents Tokenization stop words stemming Thesaurus

Index
terms 21
Lexical Analysis/Tokenization of Text
 Change text of the documents into words to be adopted
as index terms

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

but 510 B.C. – unique
22
Lexical Analysis…..
 Hyphen – break up the words (e.g. state-of-the-art = state of
the art)- but some words, e.g. gilt-edged, B-49 - unique words
which require hyphens

 Punctuation marks – remove totally unless significant,

e.g. program code: [Link] and xexe
 Case of letters – not important and can convert all to
upper or lower
23
 Analyze text into a sequence of discrete tokens (words).


Tokenization Input:“Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are

grouped together as a useful semantic unit for processing)

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

after further processing

 But what are valid tokens to omit? 24

 One word or multiple: How do you decide it is one token or
Issues in Tokenization two or more?
◦ Hewlett-Packard → Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Bahir Dar
◦ lowercase, lower-case, lower case ?
 data base, database, data-base
• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses ([Link])
25
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
◦ Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
◦ However, frequently they are not.
26
Issues in Tokenization
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
◦ Generally, don’t index numbers as text, But often very useful. Will often
index “meta-data” , including creation date, format, etc. separately

 Issues of tokenization are language specific

◦ Requires the language to be known

27
Exercise: Tokenization
 The cat slept peacefully in the living room. It’s a
very old cat.

 Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.

28
Term Weights: Term Frequency
 More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j

 May want to normalize term frequency (tf) by

dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}
29
Term Weights: Inverse Document Frequency
 Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
 An indication of a term’s discrimination power.
 Log used to dampen the effect relative to tf.
30
TF-IDF Weighting
 A typical combined term importance indicator
is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
 A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight.
 Many other ways of determining term weights
have been proposed.
 Experimentally, tf-idf has been found to work
well.
31
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8 32
Similarity Measure
 A similarity measure is a function that computes
the degree of similarity between two vectors.

 Using a similarity measure between the query

and each document:
◦ It is possible to rank the retrieved documents in the
order of presumed relevance.
◦ It is possible to enforce a certain threshold so that
33
Similarity Measure - Inner Product
 Similarity between vectors for the document di and query q can be
computed as the vector innert product (a.k.a. dot product):

sim(dj,q) = dj•qi =1=  ij iq

w w

where wij is the weight of term i in document j and wiq is the weight of term i in
the query
 For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
 For weighted term vectors, it is the sum of the products of the
weights of the matched terms.

34
Properties of Inner Product
 The inner product is unbounded.

 Favors long documents with a large number

of unique terms.

 Measures how many terms matched but not

how many terms are not matched.
35
36
37

IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Text Operations for Information Retrieval
No ratings yet
Text Operations for Information Retrieval
45 pages
Chapter Two IR
No ratings yet
Chapter Two IR
45 pages
Word Frequency Distribution Insights
No ratings yet
Word Frequency Distribution Insights
46 pages
Analyzing Word Frequency Distributions
No ratings yet
Analyzing Word Frequency Distributions
47 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
2 - Text Operation
No ratings yet
2 - Text Operation
35 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
2 - Text Operations
No ratings yet
2 - Text Operations
56 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
CH 2 - Text Operation
No ratings yet
CH 2 - Text Operation
38 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Zipf's Law in Text Analysis
No ratings yet
Zipf's Law in Text Analysis
60 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Chap 4
No ratings yet
Chap 4
76 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
Chapter 4
No ratings yet
Chapter 4
72 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
Mod4 NLP
No ratings yet
Mod4 NLP
53 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
Zipf's and Luhn's Law Analysis
No ratings yet
Zipf's and Luhn's Law Analysis
13 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Module 4
No ratings yet
Module 4
16 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Ch-2 Text Operations
No ratings yet
Ch-2 Text Operations
40 pages
0 Experimenteeff
No ratings yet
0 Experimenteeff
5 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
Inverted Index Construction Explained
No ratings yet
Inverted Index Construction Explained
54 pages
2 - Text Operation
No ratings yet
2 - Text Operation
55 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Lecture6 Text As Data Ver3
No ratings yet
Lecture6 Text As Data Ver3
69 pages
Term Weighting & Similarity Basics
50% (2)
Term Weighting & Similarity Basics
54 pages
Information Retrieval Overview
No ratings yet
Information Retrieval Overview
61 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Information Retrieval Fundamentals
No ratings yet
Information Retrieval Fundamentals
11 pages
Quantitative Text Analysis Methods
No ratings yet
Quantitative Text Analysis Methods
55 pages
Text Preprocessing in NLP
No ratings yet
Text Preprocessing in NLP
3 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Mod 4
No ratings yet
Mod 4
35 pages
Information Retrieval Overview by Jian-Yun Nie
No ratings yet
Information Retrieval Overview by Jian-Yun Nie
61 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Text
No ratings yet
Text
3 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
1-S2.0-S1877050916311589-Main - Part-5
No ratings yet
1-S2.0-S1877050916311589-Main - Part-5
7 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Term Weighting in Information Retrieval
No ratings yet
Term Weighting in Information Retrieval
34 pages
Multi Media Material
No ratings yet
Multi Media Material
101 pages
Red It
No ratings yet
Red It
30 pages
Chapter 3
No ratings yet
Chapter 3
90 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
18 pages
Chapter 4
No ratings yet
Chapter 4
83 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Understanding DNS and ICANN Functions
No ratings yet
Understanding DNS and ICANN Functions
60 pages
PHP and MySQL for Dynamic Websites
No ratings yet
PHP and MySQL for Dynamic Websites
29 pages
TCP/IP Basics for Network Admins
No ratings yet
TCP/IP Basics for Network Admins
38 pages
Chapter 2
No ratings yet
Chapter 2
58 pages
Ethics ch1
No ratings yet
Ethics ch1
25 pages
Session Validation
No ratings yet
Session Validation
2 pages
4 Year 2 Semester Final Exam Schedule
No ratings yet
4 Year 2 Semester Final Exam Schedule
2 pages
IT Chapter 2 2015
No ratings yet
IT Chapter 2 2015
26 pages
Chapter 5 Retrieval Efective
No ratings yet
Chapter 5 Retrieval Efective
24 pages
AI Problem Solving Techniques
No ratings yet
AI Problem Solving Techniques
52 pages
IT Chapter 4 2015
100% (1)
IT Chapter 4 2015
30 pages
IT-Chapter 1 B PPT 2015-20
No ratings yet
IT-Chapter 1 B PPT 2015-20
21 pages
MID - Exam For Emerging Technology
100% (5)
MID - Exam For Emerging Technology
4 pages
IT Chapter 6 2015
No ratings yet
IT Chapter 6 2015
20 pages
Model Exam For Remedial Alliance
No ratings yet
Model Exam For Remedial Alliance
4 pages
Chemistry MODEL EXAM - 2
No ratings yet
Chemistry MODEL EXAM - 2
2 pages
IT Chapter 5 2015
No ratings yet
IT Chapter 5 2015
41 pages
Remedial Chemistry Model Exam 2024
100% (3)
Remedial Chemistry Model Exam 2024
6 pages
IT Final Exam Paper 2016
No ratings yet
IT Final Exam Paper 2016
7 pages
Lesson 1 Educ 8
No ratings yet
Lesson 1 Educ 8
34 pages
Pre Assess Report 4237783
No ratings yet
Pre Assess Report 4237783
12 pages
Bishops Homily - Pentecost Sunday
No ratings yet
Bishops Homily - Pentecost Sunday
18 pages
Vocabulary Standard Unit8 With Answers
No ratings yet
Vocabulary Standard Unit8 With Answers
1 page
Top 10 Tallest Buildings Worldwide
No ratings yet
Top 10 Tallest Buildings Worldwide
25 pages
TEFCO DNV Offshore Equipments
No ratings yet
TEFCO DNV Offshore Equipments
12 pages
Microbes in Human Welfare Notes
No ratings yet
Microbes in Human Welfare Notes
111 pages
Certificate
No ratings yet
Certificate
1 page
Prompt de Anúncios
No ratings yet
Prompt de Anúncios
5 pages
American Journal of Physics Volume 82 Issue 4 2014 (Doi 10.1119 - 1.4867968) Oostra, Benjamin - Measuring The Moon's Orbit Using A Hand-Held Camera
No ratings yet
American Journal of Physics Volume 82 Issue 4 2014 (Doi 10.1119 - 1.4867968) Oostra, Benjamin - Measuring The Moon's Orbit Using A Hand-Held Camera
6 pages
Don't Let Anyone Steal Your Dreams
No ratings yet
Don't Let Anyone Steal Your Dreams
6 pages
Model FT300DF Spec Sheet: Track Mounted Crushing Plant
No ratings yet
Model FT300DF Spec Sheet: Track Mounted Crushing Plant
2 pages
Logistics Management in Retail: V-Mart Case Study
No ratings yet
Logistics Management in Retail: V-Mart Case Study
78 pages
VDOeditor v1.0.3.6 EEPROM Dumps of VDO Dashboards
No ratings yet
VDOeditor v1.0.3.6 EEPROM Dumps of VDO Dashboards
6 pages
Babul PDF
No ratings yet
Babul PDF
1 page
Chapter 6 Fdi
No ratings yet
Chapter 6 Fdi
29 pages
Christine Joy A. Magbanua Bsed-English 4 Teaching Profession
100% (1)
Christine Joy A. Magbanua Bsed-English 4 Teaching Profession
3 pages
Otl 565 Module 8 CT Project Reflection
No ratings yet
Otl 565 Module 8 CT Project Reflection
4 pages
12GMK 6250 Superstructure Hyd
100% (1)
12GMK 6250 Superstructure Hyd
15 pages
Gautam 2020a
No ratings yet
Gautam 2020a
18 pages
Product Overview: Pt. Adhigana Perkasa Mandiri
No ratings yet
Product Overview: Pt. Adhigana Perkasa Mandiri
15 pages
Perforative Peritonitis Overview
No ratings yet
Perforative Peritonitis Overview
5 pages
Case Study For Tunnel in Concrete Lining On JK
No ratings yet
Case Study For Tunnel in Concrete Lining On JK
12 pages
1.AID ProjectProposal
No ratings yet
1.AID ProjectProposal
47 pages
The Science of Forensic Entomology 1st Edition David B Rivers Download
No ratings yet
The Science of Forensic Entomology 1st Edition David B Rivers Download
41 pages
1038 TDD GD 100t Continuous Duty Pump Brochure Data Sheet
No ratings yet
1038 TDD GD 100t Continuous Duty Pump Brochure Data Sheet
2 pages
DICOM Processing and Segmentation in Python
No ratings yet
DICOM Processing and Segmentation in Python
18 pages
MIT6 050JS08 Chapter3
No ratings yet
MIT6 050JS08 Chapter3
16 pages
Delivery Note for Delta Electronics
No ratings yet
Delivery Note for Delta Electronics
1 page
2017 - jfnr-5-9-8 - Plain Water
No ratings yet
2017 - jfnr-5-9-8 - Plain Water
5 pages

Chapter 2 Text Operations

Uploaded by

Chapter 2 Text Operations

Uploaded by

Chapter Two

 A few words are very common.

 Zipf's Law states that when the distinct words in a text

That is If the words, w, in a collection are ranked, r,

w has rank r and

 The table shows the most frequently occurring words

 Luhn suggested that both extremely common and extremely

 For this, Luhn specifies two cut-off points: an upper and a

Luhn (1958) suggested that both extremely common and

Example: from 1,000,000,000 documents, there

 Therefore, one needs to preprocess the text of a

 The main operations for selecting index terms are:

 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus/wordlist, to capture

 Objective - identify words in the text

◦ Digits, hyphens, punctuation marks, case of letters

◦ Numbers are not good index terms (like 1910, 1999);

 Punctuation marks – remove totally unless significant,

 Output: Tokens (an instance of a sequence of characters that are

◦ Friends , and, Romans, Countrymen

 Each such token is now a candidate for an index entry,

 But what are valid tokens to omit? 24

 Issues of tokenization are language specific

 Mr. O’Neill thinks that the boys’ stories about

 May want to normalize term frequency (tf) by

 Using a similarity measure between the query

sim(dj,q) = dj•qi =1=  ij iq

 Favors long documents with a large number

 Measures how many terms matched but not

You might also like