0% found this document useful (0 votes)
1 views47 pages

IRT - Query Processing

The document discusses various query operations, focusing on reformulation techniques such as manual and automatic adjustments, relevance feedback, and query expansion. It explains the importance of user feedback in refining search queries to improve retrieval performance and introduces methods like Rocchio's algorithm for modifying queries based on relevant and non-relevant documents. Additionally, it covers local and global analysis for query expansion using thesauri and statistical methods to enhance search results.

Uploaded by

solai raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views47 pages

IRT - Query Processing

The document discusses various query operations, focusing on reformulation techniques such as manual and automatic adjustments, relevance feedback, and query expansion. It explains the importance of user feedback in refining search queries to improve retrieval performance and introduces methods like Rocchio's algorithm for modifying queries based on relevant and non-relevant documents. Additionally, it covers local and global analysis for query expansion using thesauri and statistical methods to enhance search results.

Uploaded by

solai raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Query Operations

Part 0ne
Dr Ali Al-Ibrahim
Reformulation of Query

Manual
• Add or remove search terms
• Change Boolean operators

Automatic
• Remove search terms
• Change weighting of search terms
• Add new search terms

2
Relevance Feedback:

Document Vectors as Points on a Surface


• Normalize all document vectors to be of length 1
• Then the ends of the vectors all lie on a surface
with unit radius
• For similar documents, we can represent parts of
this surface as a flat region
• Similar document are represented as points that are
close together on this surface

3
Results of a Search

x
x hits from
x ∆ search
x
x x
x

x documents found by search


∆ query

4
(Relevance Feedback (Concept

x hits from
x original
o ∆ search
x
x o
o

x documents identified as non-relevant


o documents identified as relevant
∆ original query
reformulated query

5
Theoretically Best Query
optimal
query o x
x
x o
o x
x x x
x x
x
o x
x
x o x
o x
x x
x

x non-relevant documents
o relevant documents
6
Theoretically Best Query

For a specific query, Q, let:


DR be the set of all relevant documents
DN-R be the set of all non-relevant documents
sim (Q, DR) be the mean similarity between query Q and
documents in DR
sim (Q, DN-R) be the mean similarity between query Q and
documents in DN-R
The theoretically best query would maximize:
F = sim (Q, DR) - sim (Q, DN-R) 7
Estimating the Best Query

In practice, DR and DN-R are not known. (The objective is to


find them.)
However, the results of an initial query can be used to estimate
sim (Q, DR) and sim (Q, DN-R).

8
Rocchio's Modified Query

Modified query vector


= Original query vector
+ Mean of relevant documents found by original query
- Mean of non-relevant documents found by original query

9
Query Modification

n1 n2
1 1
Q1 = Q0 + n1 Σ Ri - n2 Σ Si
i =1 i =1

Q0 = vector for the initial query


Q1 = vector for the modified query
Ri = vector for relevant document i
Si = vector for non-relevant document i
n1 = number of relevant documents
n2 = number of non-relevant documents
Rocchio 1971
10
Difficulties with Relevance Feedback
optimal
query o x Hits from
x
x o the initial
o x query are
x x x
x x contained in
∆ x the gray
o x
x shaded area
x o x
o x
x x
x
x non-relevant documents
o relevant documents
∆ original query
11
reformulated query
Difficulties with Relevance Feedback
optimal
results o x What region
x
set x o provides the
o x optimal
x x x
x x results set?
∆ x
o x
x
x o x
o x
x x
x
x non-relevant documents
o relevant documents
∆ original query
12
reformulated query
When to Use Relevance Feedback

Relevance feedback is most important when the user wishes to


increase recall, i.e., it is important to find all relevant
documents.
Under these circumstances, users can be expected to put effort
into searching:
• Formulate queries thoughtfully with many terms
• Review results carefully to provide feedback
• Iterate several times
• Combine automatic query enhancement with studies of
thesauruses and other manual enhancements
13
Adjusting Parameters 1: Relevance Feedback

n1 n2
1
Q1 = α Q0 + β n Σ Ri - γ 1 Σ Si
1
i =1
n2
i =1

α, β and γ are weights that adjust the importance


of the three vectors.
If γ = 0, the weights provide positive feedback,
by emphasizing the relevant documents in the
initial set.
If β = 0, the weights provide negative feedback,
by reducing the emphasis on the non-relevant
documents in the initial set.
14
Adjusting Parameters 2:Filtering Incoming Messages

D1, D2, D3, ... is a stream of incoming documents that are to be


divided into two sets:
R - documents judged relevant to an information need
S - documents judged not relevant to the information need
A query is defined as the vector in the term vector space:
Q = (w1, w2, ..., wn)
where wi is the weight given to term i
Dj will be assigned to R if similarity(Q, Dj) > λ

15
Query Operations
Chapter 5
Part Two
Operations
 Introduction
 Relevance Feedback
 Query Expansion
 Term Reweighting
 Automatic Local Analysis
 Query Expansion using Clustering
 Automatic Global Analysis
 Query Expansion using Thesaurus
 Similarity Thesaurus
 Statistical Thesaurua

17
Query Operations Introduction

 IR queries as stated by the user may not


be precise or effective.
 There are many techniques to improve a
stated query and then process that query
instead.

18
Relevance Feedback

 Use assessments by users as to the relevance


of previously returned documents to create
new (modify old) queries.
 Technique:
1. Increase weights of terms from relevant documents.
2. Decrease weight of terms from non-relevant
documents.

19
Relevance Feedback
 After initial retrieval results are
presented, allow the user to provide
feedback on the relevance of one or
more of the retrieved documents.
 Use this feedback information to
reformulate the query.
 Produce new results based on
reformulated query.
 Allows more interactive.

20
Relevance Feedback Architecture
Query Document
String corpus

Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1 ⇓ .
2. Doc2 ⇑ .
3. Doc3 ⇓
Feedback .
.
21
Query Reformulation

 Revise query to account for feedback:


 Query Expansion: Add new terms to query
from relevant documents.
 Term Reweighting: Increase weight of terms
in relevant documents and decrease weight of
terms in irrelevant documents.
 Several algorithms for query reformulation.

22
Query Reformulation for VSR

 Change query vector using vector algebra.


 Add the vectors for the relevant documents to
the query vector.
 Subtract the vectors for the irrelevant docs
from the query vector.
 This adds both positive and negatively weighted
terms to the query as well as reweighting the
initial terms.

23
Optimal Query
 Assume that the relevant set of
documents Cr are known.
 Then the best query that ranks all and only
the relevant queries at the top is:
 1  1 
qopt =
Cr


d j −
N − Cr


d j
∀d j ∈C r ∀d j ∉C r

Where N is the total number of documents.


24
Standard Rochio Method
 Since all relevant documents unknown,
just use the known relevant (Dr) and
irrelevant (Dn) sets of documents and
include the initial query q.
  β  γ 
qm = α q +
Dr


d j −
Dn


d j
∀d j ∈Dr ∀d j ∈Dn

α: Tunable weight for initial query.


β: Tunable weight for relevant documents.
γ: Tunable weight for irrelevant documents.
25
26
Ide Regular Method
 Since more feedback should perhaps
increase the degree of reformulation, do
not normalize for amount of feedback:
   
qm = α q + β ∑

d j − γ ∑

d j
∀d j ∈Dr ∀d j ∈Dn

α: Tunable weight for initial query.


β: Tunable weight for relevant documents.
γ: Tunable weight for irrelevant documents.
27
Ide “Dec Hi” Method
 Bias towards rejecting just the highest
ranked of the irrelevant documents:
   
qm = α q + β ∑

d j − γ max non − relevant ( d j )
∀ d j ∈ Dr

α: Tunable weight for initial query.


β: Tunable weight for relevant documents.
γ: Tunable weight for irrelevant document.

28
Comparison of Methods

 Overall, experimental results indicate no


clear preference for any one of the specific
methods.
 All methods generally improve retrieval
performance (recall & precision) with
feedback.
 Generally just let tunable constants equal 1.

29
Why is Feedback Not Widely Used?

 Users sometimes reluctant to provide


explicit feedback.
 Results in long queries that require more
computation to retrieve, and search
engines process lots of queries and allow
little time for each one.
 Makes it harder to understand why a
particular document was retrieved.

30
Pseudo Feedback

 Use relevance feedback methods without


explicit user input.
 Just assume the top m retrieved
documents are relevant, and use them to
reformulate the query.
 Allows for query expansion that includes
terms that are correlated with the query
terms.

31
Pseudo Feedback Results

 Found to improve performance on TREC


competition ad-hoc retrieval task.
 Works even better if top documents must
also satisfy additional boolean constraints
in order to be used in feedback.

32
33
Local vs. Global Automatic Analysis

 Local – Documents retrieved are


examined to automatically determine
query expansion. No relevance feedback
needed.
 Global – Thesaurus used to help select
terms for expansion.

34
Thesaurus
A thesaurus provides information on
synonyms and semantically related
words and phrases.
 Example:

physician
syn: ||croaker, doc, doctor, MD, medical,
mediciner, medico, ||sawbones
rel: medic, general practitioner, surgeon,

35
Thesaurus-based Query Expansion

 For each term, t, in a query, expand the query


with synonyms and related words of t from the
thesaurus.
 May weight added terms less than original query
terms.
 Generally increases recall.
 May significantly decrease precision, particularly
with ambiguous terms.
 “interest rate” → “interest rate fascinate evaluate”

36
Statistical Thesaurus

 Existing human-developed thesauri are


not easily available in all languages.
 Human thesauri are limited in the type and
range of synonymy and semantic relations
they represent.
 Semantically related terms can be
discovered from statistical analysis of
corpora.

37
Query Expansion Based on a Statistical Thesaurus

 Global thesaurus is composed of classes which


group correlated terms in the context of the whole
collection
 Such correlated terms can then be used to expand
the original user query
 This terms must be low frequency terms
 However, it is difficult to cluster low frequency terms
 To circumvent this problem, we cluster documents
into classes instead and use the low frequency terms
in these documents to define our thesaurus classes.
 This algorithm must produce small and tight clusters.

38
Automatic Local Analysis

 At query time, dynamically determine similar


terms based on analysis of top-ranked retrieved
documents.
 Base correlation analysis on only the “local” set
of retrieved documents for a specific query.
 Avoids ambiguity by determining similar
(correlated) terms only within relevant
documents.
 “Apple computer” → “Apple computer Powerbook laptop”

39
Automatic Local Analysis

 Expand query with terms found in local clusters.


 Dl – set of documents retireved for query q.
 Vl – Set of words used in Dl.
 Sl – Set of distinct stems in Vl.
 fsi,j –Frequency of stem si in document dj found in
Dl.
 Construct stem-stem association matrix.

40
Association Matrix
w1 w2 w3 …………………..wn
w1 c11 c12 c13…………………c1n
w2 c21
w3 c31
. .
. .
wn cn1

cij: Correlation factor between stems si and stem sj


cij = ∑
d k ∈Dl
f sik × f sjk

fik : Frequency of term i in document k

41
Problems with Local Analysis

 Term ambiguity may introduce irrelevant


statistically correlated terms.
 “Apple computer” → “Apple red fruit computer”
 Since terms are highly correlated anyway,
expansion may not retrieve many
additional documents.

42
Automatic Global Analysis

 Determine term similarity through a pre-


computed statistical analysis of the
complete corpus.
 Compute association matrices which
quantify term correlations in terms of how
frequently they co-occur.
 Expand queries with statistically most
similar terms.

43
Automatic Global Analysis

 There are two modern variants based on a


thesaurus-like structure built using all
documents in collection
 Query Expansion based on a Similarity
Thesaurus
 Query Expansion based on a Statistical
Thesaurus

44
Global vs. Local Analysis

 Global analysis requires intensive term


correlation computation only once at system
development time.
 Local analysis requires intensive term
correlation computation for every query at run
time (although number of terms and documents
is less than in global analysis).
 But local analysis gives better results.

45
Query Expansion Conclusions

 Expansion of queries with related terms


can improve performance, particularly
recall.
 However, must select similar terms very
carefully to avoid problems, such as loss
of precision.

46
Conclusion

 Thesaurus is a efficient method to expand


queries
 The computation is expensive but it is
executed only once
 Query expansion based on similarity
thesaurus may use high term frequency to
expand the query
 Query expansion based on statistical
thesaurus need well defined parameters
47

You might also like