0% found this document useful (0 votes)

145 views40 pages

Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto

The document discusses several techniques for preprocessing text documents prior to indexing them for information retrieval, including lexical analysis, elimination of stopwords, stemming, and selection of index terms. It also covers constructing term categorization structures and several methods for weighting terms, such as term frequency-inverse document frequency (TF-IDF), term discrimination value, and probabilistic term weighting. The goal of these techniques is to extract the most important and discriminative terms from documents to facilitate efficient and effective information retrieval.

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

145 views40 pages

Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 40

Modern Information Retrieval

Chapter 7: Text Operations

Ricardo Baeza-Yates
Berthier Ribeiro-Neto
Document Preprocessing
 Lexical analysis of the text
 Elimination of stopwords
 Stemming
 Selection of index terms
 Construction of term categorization structures
Lexical Analysis of the Text
 Word separators
 space
 digits
 hyphens
 punctuation marks
 the case of the letters
Elimination of Stopwords
 A list of stopwords
 words that are too frequent among the documents
 articles, prepositions, conjunctions, etc.

 Can reduce the size of the indexing structure

considerably

 Problem
 Search for “to be or not to be”?
Stemming
 Example
 connect, connected, connecting, connection, connections
 effectiveness --> effective --> effect
 picnicking --> picnic
 king -\-> k

 Removing strategies
 affix removal: intuitive, simple
 table lookup
 successor variety
 n-gram
Index Terms Selection
 Motivation
 A sentence is usually composed of nouns, pronouns,
articles, verbs, adjectives, adverbs, and connectives.
 Most of the semantics is carried by the noun words.

 Identification of noun groups

 A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold
Thesauri
 Peter Roget, 1988
 Example
cowardly adj.
Ignobly lacking in courage: cowardly turncoats
Syns: chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang), yellow-
bellied (slang).

 A controlled vocabulary for the indexing and

searching
The Purpose of a Thesaurus
 To provide a standard vocabulary for indexing
and searching
 To assist users with locating terms for proper
query formulation
 To provide classified hierarchies that allow the
broadening and narrowing of the current query
request
Thesaurus Term Relationships
 BT: broader
 NT: narrower
 RT: non-hierarchical, but related
Term Selection
Automatic Text Processing
by G. Salton, Chap 9,
Addison-Wesley, 1989.
Automatic Indexing
 Indexing:
 assign identifiers (index terms) to text documents.
 Identifiers:
 single-term vs. term phrase
 controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, …
 objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher names,
dates of publications, …
Two Issues
 Issue 1: indexing exhaustivity
 exhaustive: assign a large number of terms
 nonexhaustive
 Issue 2: term specificity
 broad terms (generic)
cannot distinguish relevant from nonrelevant documents
 narrow terms (specific)
retrieve relatively fewer documents, but most of them are
relevant
Parameters of
retrieval effectiveness
 Recall
Number of relevant i tems retri eved
R=
Total numb er of rele vant items in collec tion
 Precision
Number of relevant i tems retri eved
P=
Total numb er of item s retrieve d
 Goal
high recall and high precision
Retrieved
Part
b a
Nonrelevant Relevant
Items Items
c d

a a
Recall = Precision =
a +d a+b
A Joint Measure
 F-score ( β2 + 1) × P × R
F=
β2 × P + R
 β is a parameter that encode the importance of
recall and procedure.
 β =1: equal weight
 β <1: precision is more important
 β >1: recall is more important
Choices of Recall and Precision
 Both recall and precision vary from 0 to 1.
 Particular choices of indexing and search policies
have produced variations in performance ranging
from 0.8 precision and 0.2 recall to 0.1 precision
and 0.8 recall.
 In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.
Term-Frequency Consideration
 Function words
 for example, "and", "or", "of", "but", …

 the frequencies of these words are high in all texts

 Content words
 words that actually relate to document content

 varying frequencies in the different texts of a collect

 indicate term importance for content

A Frequency-Based Indexing Method
 Eliminate common function words from the document
texts by consulting a special dictionary, or stop list,
containing a list of high frequency function words.
 Compute the term frequency tfij for all remaining terms Tj
in each document Di, specifying the number of
occurrences of Tj in Di.
 Choose a threshold frequency T, and assign to each
document Di all term Tj for which tfij > T.
Inverse Document Frequency
 Inverse Document Frequency (IDF) for term Tj
N
idf j = log
df j
where dfj (document frequency of term Tj) is the
number of documents in which Tj occurs.
 fulfil both the recall and the precision
 occur frequently in individual documents but rarely in
the remainder of the collection
TFxIDF
 Weight wij of a term Tj in a document di
N
wij = tf ij × log
df j
 Eliminating common function words
 Computing the value of wij for each term Tj in each
document Di
 Assigning to the documents of a collection all terms with
sufficiently high (tf x idf) factors
Term-discrimination Value
 Useful index terms
 Distinguish the documents of a collection from
each other
 Document Space
 Two documents are assigned very similar
term sets, when the corresponding points in
document configuration appear close together
 When a high-frequency term without
discrimination is assigned, it will increase the
document space density
A Virtual Document Space

Original State After Assignment of After Assignment of

good discriminator poor discriminator
Good Term Assignment
 When a term is assigned to the documents of a
collection, the few objects to which the term is
assigned will be distinguished from the rest of
the collection.

 This should increase the average distance

between the objects in the collection and hence
produce a document space less dense than
before.
Poor Term Assignment
 A high frequency term is assigned that does not
discriminate between the objects of a collection.
Its assignment will render the document more
similar.

 This is reflected in an increase in document

space density.
Term Discrimination Value
 Definition
dvj = Q - Qj
where Q and Qj are space densities before and
after the assignments of term Tj.
1 N N
Q= ∑ ∑
N ( N −1) i =1 k =1
sim ( Di , Dk )
i ≠k

 dvj>0, Tj is a good term;

dvj<0, Tj is a poor term.
Variations of Term-Discrimination Value
with Document Frequency

Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0
TFij x dvj
 wij = tfij x dvj
N
 compared with wij =tf ij ×log
df j
N
 : decrease steadily with increasing document
df j
frequency
 dvj: increase from zero to positive as the document
frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.
Document Centroid
 Issue: efficiency problem
N(N-1) pairwise similarities
 Document centroid C = (c1, c2, c3, ..., ct)
N
c j = ∑wij
i =1

where wij is the j-th term in document i.

 Space density
N
1
Q=
N
∑sim (C , D )
i =1
i
Probabilistic Term Weighting
 Goal
Explicit distinctions between occurrences of
terms in relevant and nonrelevant documents of
a collection
 Definition
Given a user query q, and the ideal answer set of the
relevant documents
 From decision theory, the best ranking algorithm
for a document D
Pr( D | rel ) Pr( rel )
g ( D ) = log + log
Pr( D | nonrel ) Pr( nonrel )
Probabilistic Term Weighting
 Pr(rel), Pr(nonrel):
document’s a priori probabilities of relevance and
nonrelevance

Pr( D | rel ) Pr( rel )

g ( D ) = log +log
Pr( D | nonrel ) Pr( nonrel )
t

∏Pr( xi |rel )
= log t
i =1
+ constants
∏Pr(
i =1
xi |nonrel )

t
Pr( xi | rel )
= ∑log +constants
i =1 Pr( xi | nonrel )
For a specific document D
 Given a document D=(d1, d2, …, dt)
t
Pr( xi = di |rel )
g ( D) = ∑ log + constants
i =1 Pr( xi = di |nonrel )

 Assume di is either 0 (absent) or 1 (present).

=∑log
i i

1−d
+constants
d
q (1−q ) i i
i =1
i i

di
t
= ∑log
p (1−q ) (1 − p ) +constants
i
di
i i

d
q (1−p ) (1 −q )
d i i
i =1
i i i

di
=∑
t
log
( p (1−q )) (1−p ) +constants
i i i

i =1
(q (1−p )) d (1−q )
i i
i
i
Term Relevance Weight
t 1 − pi t p (1 − qi )
g ( D) = ∑log + ∑di log i + constants
i =1 1 − qi i =1 qi (1 − pi )

pj (1 −qj )
tr j =log
qj (1 −pj )
Issue
 How to compute pj and qj ?

pj = rj / R
qj = (dfj-rj)/(N-R)

 R: the total number of relevant documents

 N: the total number of documents
Estimation of Term-Relevance

 The occurrence probability of a term in the nonrelevant

documents qj is approximated by the occurrence
probability of the term in the entire document collection
qj = dfj / N

 The occurrence probabilities of the terms in the small

number of relevant documents is equal by using a
constant value pj = 0.5 for all j.
Comparison
df
0.5 * (1 − j
)
pj (1 −qj ) N
tr j =log =log
qj (1 − pj ) df j
* 0.5
N
( N −df j )
= log
df j

When N is sufficiently large, N-dfj ≈ N,

( N −df j ) N
tr j = log ≈ log
= idfj
df j
df j
Estimation of Term-Relevance
 Estimate the number of relevant documents rj in the
collection that contain term Tj as a function of the known
document frequency tfj of the term Tj.
pj = rj / R
qj = (dfj-rj)/(N-R)
R: an estimate of the total number of relevant documents
in the collection.
Summary
 Inverse document frequency, idfj
 tfij *idfj (TFxIDF)
 Term discrimination value, dvj
 tfij *dvj
 Probabilistic term weighting trj
 tfij *trj

 Global properties of terms in a document collection

Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Information Retrieval Overview
No ratings yet
Information Retrieval Overview
61 pages
Information Retrieval Overview by Jian-Yun Nie
No ratings yet
Information Retrieval Overview by Jian-Yun Nie
61 pages
IRS Automatic Indexing UNIT-2
75% (4)
IRS Automatic Indexing UNIT-2
18 pages
Module 5 (NLP)
No ratings yet
Module 5 (NLP)
30 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Automatic Indexing Techniques
No ratings yet
Automatic Indexing Techniques
48 pages
Irs Unit Ii
No ratings yet
Irs Unit Ii
25 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
No ratings yet
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
25 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Ir End Pyq Sols
No ratings yet
Ir End Pyq Sols
8 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Ii - 3 Unit
No ratings yet
Ii - 3 Unit
45 pages
Unit III
No ratings yet
Unit III
37 pages
TF Idf
100% (3)
TF Idf
38 pages
Indexing Process in IRS Systems
No ratings yet
Indexing Process in IRS Systems
15 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Term Weighting in Information Retrieval Using The Term Precision Model
No ratings yet
Term Weighting in Information Retrieval Using The Term Precision Model
19 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
69 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Cataloging and Indexing
No ratings yet
Cataloging and Indexing
52 pages
Inverted Index Construction Explained
No ratings yet
Inverted Index Construction Explained
10 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Analyzing Word Frequency Distributions
No ratings yet
Analyzing Word Frequency Distributions
47 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Vector Space Model and Features: Carl Staelin
No ratings yet
Vector Space Model and Features: Carl Staelin
28 pages
Lsa
No ratings yet
Lsa
17 pages
Latent Semantic Indexing Explained
No ratings yet
Latent Semantic Indexing Explained
17 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Term Weighting in Information Retrieval
No ratings yet
Term Weighting in Information Retrieval
34 pages
Information Retrieval Fundamentals
No ratings yet
Information Retrieval Fundamentals
11 pages
Information Retrieval: Lecture One
No ratings yet
Information Retrieval: Lecture One
101 pages
Module 4 Notes
No ratings yet
Module 4 Notes
34 pages
IR - Midsem Question Paper - 2024 - Solutionfull
No ratings yet
IR - Midsem Question Paper - 2024 - Solutionfull
7 pages
IRSunit 2
No ratings yet
IRSunit 2
20 pages
L03
No ratings yet
L03
16 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Mod4 NLP
No ratings yet
Mod4 NLP
53 pages
Term Weighting & Similarity Basics
50% (2)
Term Weighting & Similarity Basics
54 pages
Modern Information Retrieval: Parallel and Distributed IR
No ratings yet
Modern Information Retrieval: Parallel and Distributed IR
15 pages
Modern IR: Key Models & Techniques
No ratings yet
Modern IR: Key Models & Techniques
14 pages
Relevance Ranking in Information Retrieval
No ratings yet
Relevance Ranking in Information Retrieval
34 pages
CSCI 7000 Modern Information Retrieval Jim Martin
No ratings yet
CSCI 7000 Modern Information Retrieval Jim Martin
36 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
Information Retrieval Lecture Overview
No ratings yet
Information Retrieval Lecture Overview
6 pages
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
No ratings yet
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
28 pages
Makalah BHS Inggris
No ratings yet
Makalah BHS Inggris
15 pages
Fransiskus Daud Try Surya A Bahasa Inggris PTK PPG DALJAB 2
No ratings yet
Fransiskus Daud Try Surya A Bahasa Inggris PTK PPG DALJAB 2
47 pages
Doublespeak
100% (1)
Doublespeak
10 pages
Dictionary Usage Guide
No ratings yet
Dictionary Usage Guide
2 pages
IELTS Writing General Training Task 1
100% (1)
IELTS Writing General Training Task 1
5 pages
MIL - Module1 - Intro
No ratings yet
MIL - Module1 - Intro
9 pages
Mastering Affect vs. Effect
No ratings yet
Mastering Affect vs. Effect
2 pages
7 Elements of Communication
No ratings yet
7 Elements of Communication
4 pages
Parts of Speech - Tests and Activities PDF
100% (1)
Parts of Speech - Tests and Activities PDF
42 pages
Literature and English Language Teaching and Learn
No ratings yet
Literature and English Language Teaching and Learn
7 pages
WH Questions in English
100% (3)
WH Questions in English
5 pages
стилистика
No ratings yet
стилистика
9 pages
(Semantics and Pragmatics) Week 5 - Sense Relations-1
No ratings yet
(Semantics and Pragmatics) Week 5 - Sense Relations-1
34 pages
CEFR Language Levels Explained
No ratings yet
CEFR Language Levels Explained
5 pages
Book of Full Articles of The Annual International Conference On Languages, Linguistics, Translation and Literature
100% (1)
Book of Full Articles of The Annual International Conference On Languages, Linguistics, Translation and Literature
285 pages
Language Games for Learners
100% (1)
Language Games for Learners
27 pages
Edci 463 Udl Lesson Plan Template Ps
No ratings yet
Edci 463 Udl Lesson Plan Template Ps
8 pages
T.40.Estrategias de Comunicación. Definición y Tipología.
No ratings yet
T.40.Estrategias de Comunicación. Definición y Tipología.
5 pages
Business Preliminary Read Write Sample Paper 1 - Answer Key
No ratings yet
Business Preliminary Read Write Sample Paper 1 - Answer Key
5 pages
Theory Synthesis
No ratings yet
Theory Synthesis
3 pages
Sense Relations and The Applications in English Vocabulary Teaching
100% (1)
Sense Relations and The Applications in English Vocabulary Teaching
5 pages
English Engravers Roman Family
No ratings yet
English Engravers Roman Family
2 pages
Understanding Suprasentential Grammar
100% (1)
Understanding Suprasentential Grammar
4 pages
Ayman Mohamed MSU Resume
No ratings yet
Ayman Mohamed MSU Resume
3 pages
Reading Theories: Traditional to Metacognitive
No ratings yet
Reading Theories: Traditional to Metacognitive
11 pages
Active and Passive Voice
No ratings yet
Active and Passive Voice
20 pages
Sub Na’vi Rhyming Dictionary
No ratings yet
Sub Na’vi Rhyming Dictionary
14 pages
Figurative Language Lesson Plan
No ratings yet
Figurative Language Lesson Plan
12 pages
Leech Principios de Pragmática Notas
No ratings yet
Leech Principios de Pragmática Notas
2 pages
TKT Module 3 Practice Test Guide
100% (2)
TKT Module 3 Practice Test Guide
5 pages