0% found this document useful (0 votes)

1 views47 pages

IRT - Query Processing

The document discusses various query operations, focusing on reformulation techniques such as manual and automatic adjustments, relevance feedback, and query expansion. It explains the importance of user feedback in refining search queries to improve retrieval performance and introduces methods like Rocchio's algorithm for modifying queries based on relevant and non-relevant documents. Additionally, it covers local and global analysis for query expansion using thesauri and statistical methods to enhance search results.

Uploaded by

solai raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views47 pages

IRT - Query Processing

Uploaded by

solai raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Query Operations

Part 0ne
Dr Ali Al-Ibrahim
Reformulation of Query

Manual
• Add or remove search terms
• Change Boolean operators

Automatic
• Remove search terms
• Change weighting of search terms
• Add new search terms

2
Relevance Feedback:

Document Vectors as Points on a Surface

• Normalize all document vectors to be of length 1
• Then the ends of the vectors all lie on a surface
with unit radius
• For similar documents, we can represent parts of
this surface as a flat region
• Similar document are represented as points that are
close together on this surface

3
Results of a Search

x
x hits from
x ∆ search
x
x x
x

x documents found by search

∆ query

4
(Relevance Feedback (Concept

x hits from
x original
o ∆ search
x
x o
o

x documents identified as non-relevant

o documents identified as relevant
∆ original query
reformulated query

5
Theoretically Best Query
optimal
query o x
x
x o
o x
x x x
x x
x
o x
x
x o x
o x
x x
x

x non-relevant documents
o relevant documents
6
Theoretically Best Query

For a specific query, Q, let:

DR be the set of all relevant documents
DN-R be the set of all non-relevant documents
sim (Q, DR) be the mean similarity between query Q and
documents in DR
sim (Q, DN-R) be the mean similarity between query Q and
documents in DN-R
The theoretically best query would maximize:
F = sim (Q, DR) - sim (Q, DN-R) 7
Estimating the Best Query

In practice, DR and DN-R are not known. (The objective is to

find them.)
However, the results of an initial query can be used to estimate
sim (Q, DR) and sim (Q, DN-R).

8
Rocchio's Modified Query

Modified query vector

= Original query vector
+ Mean of relevant documents found by original query
- Mean of non-relevant documents found by original query

9
Query Modification

n1 n2
1 1
Q1 = Q0 + n1 Σ Ri - n2 Σ Si
i =1 i =1

Q0 = vector for the initial query

Q1 = vector for the modified query
Ri = vector for relevant document i
Si = vector for non-relevant document i
n1 = number of relevant documents
n2 = number of non-relevant documents
Rocchio 1971
10
Difficulties with Relevance Feedback
optimal
query o x Hits from
x
x o the initial
o x query are
x x x
x x contained in
∆ x the gray
o x
x shaded area
x o x
o x
x x
x
x non-relevant documents
o relevant documents
∆ original query
11
reformulated query
Difficulties with Relevance Feedback
optimal
results o x What region
x
set x o provides the
o x optimal
x x x
x x results set?
∆ x
o x
x
x o x
o x
x x
x
x non-relevant documents
o relevant documents
∆ original query
12
reformulated query
When to Use Relevance Feedback

Relevance feedback is most important when the user wishes to

increase recall, i.e., it is important to find all relevant
documents.
Under these circumstances, users can be expected to put effort
into searching:
• Formulate queries thoughtfully with many terms
• Review results carefully to provide feedback
• Iterate several times
• Combine automatic query enhancement with studies of
thesauruses and other manual enhancements
13
Adjusting Parameters 1: Relevance Feedback

n1 n2
1
Q1 = α Q0 + β n Σ Ri - γ 1 Σ Si
1
i =1
n2
i =1

α, β and γ are weights that adjust the importance

of the three vectors.
If γ = 0, the weights provide positive feedback,
by emphasizing the relevant documents in the
initial set.
If β = 0, the weights provide negative feedback,
by reducing the emphasis on the non-relevant
documents in the initial set.
14
Adjusting Parameters 2:Filtering Incoming Messages

D1, D2, D3, ... is a stream of incoming documents that are to be

divided into two sets:
R - documents judged relevant to an information need
S - documents judged not relevant to the information need
A query is defined as the vector in the term vector space:
Q = (w1, w2, ..., wn)
where wi is the weight given to term i
Dj will be assigned to R if similarity(Q, Dj) > λ

15
Query Operations
Chapter 5
Part Two
Operations
 Introduction
 Relevance Feedback
 Query Expansion
 Term Reweighting
 Automatic Local Analysis
 Query Expansion using Clustering
 Automatic Global Analysis
 Query Expansion using Thesaurus
 Similarity Thesaurus
 Statistical Thesaurua

17
Query Operations Introduction

 IR queries as stated by the user may not

be precise or effective.
 There are many techniques to improve a
stated query and then process that query
instead.

18
Relevance Feedback

 Use assessments by users as to the relevance

of previously returned documents to create
new (modify old) queries.
 Technique:
1. Increase weights of terms from relevant documents.
2. Decrease weight of terms from non-relevant
documents.

19
Relevance Feedback
 After initial retrieval results are
presented, allow the user to provide
feedback on the relevance of one or
more of the retrieved documents.
 Use this feedback information to
reformulate the query.
 Produce new results based on
reformulated query.
 Allows more interactive.

20
Relevance Feedback Architecture
Query Document
String corpus

Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1 ⇓ .
2. Doc2 ⇑ .
3. Doc3 ⇓
Feedback .
.
21
Query Reformulation

 Revise query to account for feedback:

 Query Expansion: Add new terms to query
from relevant documents.
 Term Reweighting: Increase weight of terms
in relevant documents and decrease weight of
terms in irrelevant documents.
 Several algorithms for query reformulation.

22
Query Reformulation for VSR

 Change query vector using vector algebra.

 Add the vectors for the relevant documents to
the query vector.
 Subtract the vectors for the irrelevant docs
from the query vector.
 This adds both positive and negatively weighted
terms to the query as well as reweighting the
initial terms.

23
Optimal Query
 Assume that the relevant set of
documents Cr are known.
 Then the best query that ranks all and only
the relevant queries at the top is:
 1  1 
qopt =
Cr
∑

d j −
N − Cr
∑

d j
∀d j ∈C r ∀d j ∉C r

Where N is the total number of documents.

24
Standard Rochio Method
 Since all relevant documents unknown,
just use the known relevant (Dr) and
irrelevant (Dn) sets of documents and
include the initial query q.
  β  γ 
qm = α q +
Dr
∑

d j −
Dn
∑

d j
∀d j ∈Dr ∀d j ∈Dn

α: Tunable weight for initial query.

β: Tunable weight for relevant documents.
γ: Tunable weight for irrelevant documents.
25
26
Ide Regular Method
 Since more feedback should perhaps
increase the degree of reformulation, do
not normalize for amount of feedback:
   
qm = α q + β ∑

d j − γ ∑

d j
∀d j ∈Dr ∀d j ∈Dn

α: Tunable weight for initial query.

β: Tunable weight for relevant documents.
γ: Tunable weight for irrelevant documents.
27
Ide “Dec Hi” Method
 Bias towards rejecting just the highest
ranked of the irrelevant documents:
   
qm = α q + β ∑

d j − γ max non − relevant ( d j )
∀ d j ∈ Dr

α: Tunable weight for initial query.

β: Tunable weight for relevant documents.
γ: Tunable weight for irrelevant document.

28
Comparison of Methods

 Overall, experimental results indicate no

clear preference for any one of the specific
methods.
 All methods generally improve retrieval
performance (recall & precision) with
feedback.
 Generally just let tunable constants equal 1.

29
Why is Feedback Not Widely Used?

 Users sometimes reluctant to provide

explicit feedback.
 Results in long queries that require more
computation to retrieve, and search
engines process lots of queries and allow
little time for each one.
 Makes it harder to understand why a
particular document was retrieved.

30
Pseudo Feedback

 Use relevance feedback methods without

explicit user input.
 Just assume the top m retrieved
documents are relevant, and use them to
reformulate the query.
 Allows for query expansion that includes
terms that are correlated with the query
terms.

31
Pseudo Feedback Results

 Found to improve performance on TREC

competition ad-hoc retrieval task.
 Works even better if top documents must
also satisfy additional boolean constraints
in order to be used in feedback.

32
33
Local vs. Global Automatic Analysis

 Local – Documents retrieved are

examined to automatically determine
query expansion. No relevance feedback
needed.
 Global – Thesaurus used to help select
terms for expansion.

34
Thesaurus
A thesaurus provides information on
synonyms and semantically related
words and phrases.
 Example:

physician
syn: ||croaker, doc, doctor, MD, medical,
mediciner, medico, ||sawbones
rel: medic, general practitioner, surgeon,

35
Thesaurus-based Query Expansion

 For each term, t, in a query, expand the query

with synonyms and related words of t from the
thesaurus.
 May weight added terms less than original query
terms.
 Generally increases recall.
 May significantly decrease precision, particularly
with ambiguous terms.
 “interest rate” → “interest rate fascinate evaluate”

36
Statistical Thesaurus

 Existing human-developed thesauri are

not easily available in all languages.
 Human thesauri are limited in the type and
range of synonymy and semantic relations
they represent.
 Semantically related terms can be
discovered from statistical analysis of
corpora.

37
Query Expansion Based on a Statistical Thesaurus

 Global thesaurus is composed of classes which

group correlated terms in the context of the whole
collection
 Such correlated terms can then be used to expand
the original user query
 This terms must be low frequency terms
 However, it is difficult to cluster low frequency terms
 To circumvent this problem, we cluster documents
into classes instead and use the low frequency terms
in these documents to define our thesaurus classes.
 This algorithm must produce small and tight clusters.

38
Automatic Local Analysis

 At query time, dynamically determine similar

terms based on analysis of top-ranked retrieved
documents.
 Base correlation analysis on only the “local” set
of retrieved documents for a specific query.
 Avoids ambiguity by determining similar
(correlated) terms only within relevant
documents.
 “Apple computer” → “Apple computer Powerbook laptop”

39
Automatic Local Analysis

 Expand query with terms found in local clusters.

 Dl – set of documents retireved for query q.
 Vl – Set of words used in Dl.
 Sl – Set of distinct stems in Vl.
 fsi,j –Frequency of stem si in document dj found in
Dl.
 Construct stem-stem association matrix.

40
Association Matrix
w1 w2 w3 …………………..wn
w1 c11 c12 c13…………………c1n
w2 c21
w3 c31
. .
. .
wn cn1

cij: Correlation factor between stems si and stem sj

cij = ∑
d k ∈Dl
f sik × f sjk

fik : Frequency of term i in document k

41
Problems with Local Analysis

 Term ambiguity may introduce irrelevant

statistically correlated terms.
 “Apple computer” → “Apple red fruit computer”
 Since terms are highly correlated anyway,
expansion may not retrieve many
additional documents.

42
Automatic Global Analysis

 Determine term similarity through a pre-

computed statistical analysis of the
complete corpus.
 Compute association matrices which
quantify term correlations in terms of how
frequently they co-occur.
 Expand queries with statistically most
similar terms.

43
Automatic Global Analysis

 There are two modern variants based on a

thesaurus-like structure built using all
documents in collection
 Query Expansion based on a Similarity
Thesaurus
 Query Expansion based on a Statistical
Thesaurus

44
Global vs. Local Analysis

 Global analysis requires intensive term

correlation computation only once at system
development time.
 Local analysis requires intensive term
correlation computation for every query at run
time (although number of terms and documents
is less than in global analysis).
 But local analysis gives better results.

45
Query Expansion Conclusions

 Expansion of queries with related terms

can improve performance, particularly
recall.
 However, must select similar terms very
carefully to avoid problems, such as loss
of precision.

46
Conclusion

 Thesaurus is a efficient method to expand

queries
 The computation is expensive but it is
executed only once
 Query expansion based on similarity
thesaurus may use high term frequency to
expand the query
 Query expansion based on statistical
thesaurus need well defined parameters
47

Essay Writing for Literature Students
100% (2)
Essay Writing for Literature Students
5 pages
Earth
No ratings yet
Earth
2 pages
The Write Thing 1st Edition Kwame Alexander Available Instanly
No ratings yet
The Write Thing 1st Edition Kwame Alexander Available Instanly
157 pages
Industrial Iot 4G Lte Router & Gateway: ICR-3231, ICR-3231W
No ratings yet
Industrial Iot 4G Lte Router & Gateway: ICR-3231, ICR-3231W
4 pages
Korg Volca Manual - English
No ratings yet
Korg Volca Manual - English
11 pages
Swapcard-Hybrid Event Design Handbook-V1
No ratings yet
Swapcard-Hybrid Event Design Handbook-V1
16 pages
LIC
No ratings yet
LIC
33 pages
Deutz Engine Repair Manual TCD 3 6 l4
97% (75)
Deutz Engine Repair Manual TCD 3 6 l4
20 pages
Verificacion
No ratings yet
Verificacion
7 pages
Sds Tuball Coat e H2o DBD Beta Eu Eng v3-6
No ratings yet
Sds Tuball Coat e H2o DBD Beta Eu Eng v3-6
12 pages
Paper 3 Section B
No ratings yet
Paper 3 Section B
5 pages
Scan 06 Feb 25 23 40 47
No ratings yet
Scan 06 Feb 25 23 40 47
6 pages
CST302 Compiler Design, June 2023
No ratings yet
CST302 Compiler Design, June 2023
3 pages
Pierre Aoun - Final Report
No ratings yet
Pierre Aoun - Final Report
55 pages
Geologic Time Scale
100% (2)
Geologic Time Scale
12 pages
Technical Manual for Engineers
No ratings yet
Technical Manual for Engineers
12 pages
A Legal Analysis On The Issues Involving The Wreck Removal Convention A Philippine Perspective
100% (1)
A Legal Analysis On The Issues Involving The Wreck Removal Convention A Philippine Perspective
32 pages
Shadowrun 4E - Core - Arsenal Errata
No ratings yet
Shadowrun 4E - Core - Arsenal Errata
4 pages
NTPC Equal Opportunity Policy Overview
No ratings yet
NTPC Equal Opportunity Policy Overview
14 pages
Yamaha Ventura Phazer 8GC - 28197 - J1
No ratings yet
Yamaha Ventura Phazer 8GC - 28197 - J1
212 pages
Modals For Class 4 PDF
No ratings yet
Modals For Class 4 PDF
12 pages
Karrot Brand Final
No ratings yet
Karrot Brand Final
5 pages
Action Research - Handout
No ratings yet
Action Research - Handout
3 pages
DTC Agreement Between Netherlands and Malta
No ratings yet
DTC Agreement Between Netherlands and Malta
25 pages
Letter of Bid: Road Widening and Upgrading Works in Himalayan Tole, Nikosera, Madhyapur Thimi 09, Bhaktapur
0% (1)
Letter of Bid: Road Widening and Upgrading Works in Himalayan Tole, Nikosera, Madhyapur Thimi 09, Bhaktapur
2 pages
Narain Shankar
No ratings yet
Narain Shankar
7 pages
Impact of Demonetisation in India
No ratings yet
Impact of Demonetisation in India
23 pages
Chemistry Experiment: KCl & Water Reaction
100% (1)
Chemistry Experiment: KCl & Water Reaction
7 pages
AP 2 Module 1
No ratings yet
AP 2 Module 1
31 pages
Rio+20: Sustainable Development Goals
No ratings yet
Rio+20: Sustainable Development Goals
13 pages

IRT - Query Processing

Uploaded by

IRT - Query Processing

Uploaded by

Query Operations

Document Vectors as Points on a Surface

x documents found by search

x documents identified as non-relevant

For a specific query, Q, let:

In practice, DR and DN-R are not known. (The objective is to

Modified query vector

Q0 = vector for the initial query

Relevance feedback is most important when the user wishes to

α, β and γ are weights that adjust the importance

D1, D2, D3, ... is a stream of incoming documents that are to be

 IR queries as stated by the user may not

 Use assessments by users as to the relevance

 Revise query to account for feedback:

 Change query vector using vector algebra.

Where N is the total number of documents.

α: Tunable weight for initial query.

α: Tunable weight for initial query.

α: Tunable weight for initial query.

 Overall, experimental results indicate no

 Users sometimes reluctant to provide

 Use relevance feedback methods without

 Found to improve performance on TREC

 Local – Documents retrieved are

 For each term, t, in a query, expand the query

 Existing human-developed thesauri are

 Global thesaurus is composed of classes which

 At query time, dynamically determine similar

 Expand query with terms found in local clusters.

cij: Correlation factor between stems si and stem sj

fik : Frequency of term i in document k

 Term ambiguity may introduce irrelevant

 Determine term similarity through a pre-

 There are two modern variants based on a

 Global analysis requires intensive term

 Expansion of queries with related terms

 Thesaurus is a efficient method to expand

You might also like