0% found this document useful (0 votes)

72 views22 pages

Text Search For Fine-Grained Semi-Structured Data: Soumen Chakrabarti Indian Institute of Technology, Bombay

Information retrieval (IR) is the study of information retrieval techniques. A graph model for semi-structured data with "free-form" text in nodes is used to rank documents based on similarity. Idf (idf(t) = log(1+ d / d)) scales up documents by cosine similarity with query Absolute term count or scaled by max term count.

Uploaded by

postscript

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PS, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views22 pages

Text Search For Fine-Grained Semi-Structured Data: Soumen Chakrabarti Indian Institute of Technology, Bombay

Uploaded by

postscript

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PS, PDF, TXT or read online on Scribd

You are on page 1/ 22

Text Search for Fine-grained

Semi-structured Data

Soumen Chakrabarti
Indian Institute of Technology, Bombay
www.cse.iitb.ac.in/~soumen/

Acknowledgments
S. Sudarshan Arvind Hulgeri
B. Aditya Parag

Two extreme search paradigms

Searching a RDBMS Information Retrieval
Complex data model: Collection = set of
tables, rows, documents, document
columns, data types = sequence of terms
Expressive, powerful Terms and phrases
query language present or absent
Need to know No (nontrivial)
schema to query schema to learn
Answer = unordered Answer = sequence
set of rows of documents
Ranking: afterthought Ranking: central to IR

1
Convergence?
SQLXML search Web searchIR
Trees, reference links Documents are nodes
Labeled edges in a graph
Nodes may contain Hyperlink edges have
Structured data important but
Free text fields unspecified semantics
Data vs. document Google, HITS
Query involves node Query language
data and edge labels remains primitive
Partial knowledge of No data types
schema ok No use of tag-tree
Answer = set of paths Answer = URL list

Outline of this tutorial

Review of text indexing and
information retrieval (IR)
Support for text search and similarity join in
relational databases with text columns
Text search features in major XML query
languages (and what’s missing)
A graph model for semi-structured data with
“free-form” text in nodes
Proximity search formulations and techniques;
how to rank responses
Folding in user feedback
Trends and research problems
!""! #$%&'%(%')* +

2
Text indexing basics
“Inverted index” maps from
term to document IDs ;<= >?@AB CD EFDD FG >?@A
Term offset info enables
D1
HCIJ FEK >?@A KFLA

phrase and proximity

(“near”) searches
MFN@ >?@A CD O?CL FG
>?@A HCIJ LAH >?@A HFL

Document boundary and

limitations of “near” queries
>?@A D2
D1: 1, 5, 8

Can extend inverted index

D2: 1, 5, 8

to map terms to
LAH
D2: 7
FEK
Table names, column names
D1: 7
EFDD
Primary keys, RIDs
D1: 3

XML DOM node IDs

,-./ 0110 23456474689 :

Information retrieval basics

Stopwords and stemming `abc

Each term t in lexicon gets a

hijcbk lc`mebn

dimension in vector space

Documents and the query
deff
eg

are vectors in term space Scale up Scale

down
Component of d along axis t is TF(d,t)
Absolute term count or scaled by max term count
Downplay frequent terms: IDF(t) = log(1+|D|/|D _|)
Better model: document vector d has component
TF(d,t) IDF(t) for term t
Query is like another “document”; documents
ranked by cosine similarity with query
PQRS TUUT VWXYZX[XZ\] ^

3
Map
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER

“None” = nothing more than string equality, containment

(substring), and perhaps lexicographic ordering
“Schema”: Extensions to query languages, user needs to
know data schema, IR-like ranking schemes, no implicit
joins
“No schema”: Keyword queries, implicit joins
opqr stts uvwxywzwy{| }

WHIRL (Cohen 1998)

place(univ,state) and job(univ,dept)
Ranked retrieval from a RDBMS:
select univ from job where dept ~ ‘Civil’
Ranked similarity join on text columns:
select state, dept from place, job
where place.univ ~ job.univ
Limit answer to best k matches only
Avoid evaluating full Cartesian product
“Iceberg” query
Useful for data cleaning and integration
opqr stts uvwxywzwy{| ~

4
WHIRL scoring function
A where-clause in WHIRL is a
Boolean predicate as in SQL (age=35)
Score for such clauses are 0/1
Similarity predicate (job ~ ‘Web design’)
Score = cosine(job, ‘Web design’)
Conjunction or disjunction of clauses
Sub-clause scores interpreted as probabilities
score(B ∧ … ∧B ; θ)=Π ≤ ≤ score(B ,θ)
score(B ∨ … ∨B ; θ)=1 — Π ≤ ≤ (1—score(B ,θ))

Query execution strategy

select state, dept from place, job
where place.univ ~ job.univ
Start with place(U1,S) and job(U2,D)
where U1, U2, S and D are “free”
Any binding of these variables to constants is
associated with a score
Greedily extend the current bindings for
maximum gain in score
Backtrack to find more solutions

5
XQuery
Quilt + Lorel + YATL + XML-QL
Path expressions recipes.xml
<dishes_with_flour> { FOR $r IN
document("recipes.xml")
//recipe[//ingredient[@name="flour"]]
RETURN <dish>{$r/title/text()}</dish> }
</dishes_with_flour>
recipe $r

title

name
Tortilla ingredient “flour”
¡¢£ ¤¥¥¤ ¦§¨©ª¨«¨ª¬ ®®

Early text support in XQuery

Title of books containing some para mentioning
both “sailing” and “windsurfing”
FOR $b IN document("bib.xml")//book
WHERE SOME $p IN $b//paragraph SATISFIES
(contains($p,"sailing") AND
contains($p,"windsurfing"))
RETURN $b/title
Title and text of documents containing at least
three occurrences of “stocks”
FOR $a IN view("text_table") WHERE
numMatches($a/text_document,"stocks") > 3
RETURN
<text>{$a/text_title}{$a/text_document}</>
¯°±² ³´´³ µ¶·¸¹·º·¹»¼ ½³

6
Tutorial outline
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER

Review of text indexing and information retrieval

Support for text search and similarity join in
relational databases with text columns (WHIRL)
Adding IR-like text search features to XML query
languages (Chinenyanga et al. Führ et al. 2001)
¾¿ÀÁ ÂÃÃÂ ÄÅÆÇÈÆÉÆÈÊË ÌÍ

ELIXIR: Adding IR to XQuery

Ranked select
for $t in document(“db.xml”)/items/(book|cd)
where $t/text() ~ “Ukrainian recipe”
return <dish>$t</dish>
Ranked similarity join: find titles in recent
VLDB proceedings similar to speeches in
Macbeth
for $vi in
document(“vldb.xml”)/issue[@volume>24],
$si in document(“macbeth.xml”)//speech
where $vi//article/title ~ $si
return <similar><title>$vi//article/title</>
<speech>$si</></similar>
¾¿ÀÁ ÂÃÃÂ ÄÅÆÇÈÆÉÆÈÊË ÌÎ

7
How ELIXIR works
ELIXIR VLDB.xml Macbeth.xml Base XML
query documents

ELIXIR XQuery filters/

Compiler transformers

Flatten to WHIRL

WHIRL select/join filters

Rewrite to XML

Result
ÏÐÑÒ ÓÔÔÓ ÕÖ×ØÙ×Ú×ÙÛÜ ÝÞ

ÿ
A more detailed view

àáââãäåàæçèãéäåêëàìå àíðï üãé äî ñ
å
àíîï áðèäåñàìå ñàìå àâðäüä üãé äî ñ
å
àáââãäåàæçèãéäåòóàìå àâýääðåç îäèíü çãî
àíîï áðèäåàï áï èäåôáõä ö÷øùúùûä ç îï ãüä àìå
ö÷øùúùû áçü âýíï áíè þçáüàìåñàìåàìå àìåàìå

!" #$% &' !" #$9 &'

(!)*+,'%-./01234+56788&99*, (!)*+,'%-.DEFGHIJKLMNOPQQEFIQRFHSHQRTHHFJ
:;!5*+, <=88%&%5,
UHIVUS WIVTNHXWNYSHXZ [ER \WQXWQIVTNHX \ W Q]^^X
",%*"'
%*>5,%&%5, #$% ?88%*>5, ? 8 {||}~
BC
A _`aab_cdefgb_ fhigbjk lmgfnio p
à@ òêåàï ãýèäåàï áïèäåôáõä ö÷øùúùû áçü l q k d m rstuvuwgo xkmcdig y
âýíï áíè þçáüàì ï áïèäåàìï ãýèäåàì@ òêå _zfhigb_zcdefgb_z`aab

][IYINH[NYSHP ]^[I YINHP ]^^[NYSHP [I YINH [ NYSH

WHIRL query

Result
ÏÐÑÒ ÓÔÔÓ ÕÖ×ØÙ×Ú×ÙÛÜ Ýß

8
Observations
SQL/XQuery + IR-like result ranking
Schema knowledge remains essential
“Free-form” text vs. tagged, typed field
Element hierarchy, element names,
IDREFs
Typical Web search is two words long
End-users don’t type SQL or XQuery
Possible remedy: HTML form access
Limitation: restricted views and queries
¡ ¢££¢ ¤¥¦§¨¦©¦¨ª« ¬

Using proximity without schema

General, detailed representation: XML
Lowest common representation
Collection, document, terms
Document = node, hyperlink = edge
Middle ground
Graph with text (or structured data) in nodes
Links: element, subpart, IDREF, foreign keys
All links hint at unspecified notion of proximity
Exploit structure where available, but do not
impose structure by fiat
®¯°± ²³³² ´µ¶·¸¶¹¶¸º» ¼½

9
Two paradigms of proximity search
A single node as query response
Find node that matches query terms…
…or is “near” nodes matching query terms
(Goldman et al., 1998)
A connected subgraph as query response
Single node may not match all keywords
No natural “page boundary”

¾¿ÀÁ ÂÃÃÂ ÄÅÆÇÈÆÉÆÈÊË ÌÍ

Single-node response examples

Travolta, Cage Movie
Actor, Face/Off “is-a”
Travolta, Cage, Gathering Grease Face/Off
Movie
Face/Off “acted-in”

Kleiser, Movie A3 Travolta Cage

“directed”

Gathering, Grease “is-a”

Kleiser, Woo, Actor
Actor
Travolta Kleiser Woo
“is-a”
Director
ÎÏÐÑ ÒÓÓÒ ÔÕÖ×ØÖÙÖØÚÛ ÒÓ

10
Basic search strategy
Node subset A activated because they
match query keyword(s)
Look for node near nodes that are
activated
Goodness of response node depends
Directly on degree of activation
Inversely on distance from activated node(s)

ÜÝÞß àááà âãäåæäçäæèé àê

Ranking a single node response

Activated node set A
Rank node r in “response set” R based
on proximity to nodes a in A
Nodes have relevance ρ ù and ρú in [0,1]
Edge costs are “specified by the system”
d(a,r) = cost of shortest path from a to r
Bond between a and r ρ (a ) ρ R ( r )
b(a, r ) = A
d (a, r )t
Parameter t tunes relative emphasis on
distance and relevance score
Several ad-hoc choices
ëìíî ïððï ñòóôõóöóõ÷ø ïï

11
Scoring single response nodes
Additive
score(r ) = ∑a∈A b(a, r )

Belief
score(r ) = 1 − ∏a∈A (1 − b(a, r ))

Goal: list a limited number of find nodes

with the largest scores
Performance issues
Assume the graph is in memory?
Precompute all-pairs shortest path (|V | )?

Prune unpromising candidates?

ûüýþ ÿ ÿ ÿ

Hub indexing
Decompose APSP problem using sparse
vertex cuts
|A|+|B | shortest paths to p A B
|A|+|B | shortest paths to q
d(p,q) p
a b
To find d(a,b) compare
d(apb) not through q q
d(aqb) not through p
d(apqb)
d(aqpb)
Greatest savings when |A|≈|B|
Heuristics to find cuts, e.g. large-degree nodes

12
Connected subgraph as response
Single node may not match all keywords
No natural “page boundary”
Two scenarios
Keyword search on relational data
• Keywords spread among normalized relations
Keyword search on XML-like or Web data
• Keywords spread among DOM nodes and
subtrees

!"#$"%"$&' (

Tutorial outline
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER

Adding IR-like text search features to XML query

languages
A graph model for relational data with “free-form”
text search and implicit joins
Generalizing to graph models for XML
)*+, -..- /0123141356 -7

13
Keyword search on relational data
Tuple = node
GHIJK TUVJW
LMN MOP XYZQ[\]

Some columns have text LMNQR XYZQ[^Y_Q

Foreign key constraints =

SSS SSS
efghij ` WHIJK

edges in schema graph klmnopqr

klmnopstuv
abNcd [\]
XYZQ[\]

Query = set of terms www SSS

No natural notion AuthorID PaperID AuthorID AuthorName

of a document A1 P1 A1 Chaudhuri

Normalization
A2 P2 A2 Sudarshan
A3 P2 A3 Hulgeri
Join may be needed Citing Cited PaperID PaperName
to generate results P2 P1 P1 DBXplorer
Cycles may exist in P2 BANKS
schema graph: ‘Cites’
89:; <==< >?@AB@C@BDE <F

DBXplorer and DISCOVER

Enumerate subsets of relations in schema graph
which, when joined, may contain rows which
have all keywords in the query
“Join trees” derived from schema graph
Output SQL query for each join tree
Generate joins, checking rows for matches
(Agrawal et al. 2001, Hristidis et al. 2002)
T4
K1,K2,K3
T4 T2 T2 T3

T1 T2 T3 K2 T5
T4 T2 T3
T5
T2 ~
T3 T5
xyz{ |}}| K3 |

14
Discussion
Exploits relational Coarse-grained
schema information to ranking based on
contain search schema tree
Pushes final Does not model
extraction of joined proximity or (dis)
tuples into RDBMS similarity of individual
Faster than dealing tuples
with full data graph No recipe for data
directly with less regular (e.g.
XML) or ill-defined
schema

Generalized graph proximity

General data graph
Nodes have text, can be scored against query
Edge weights express dissimilarity
Query is a set of keywords as before
Response is a connected subgraph of the
database
Each response graph is scored using
Node weights which reflect match, maximize
Edge weights which reflect lack of proximity,
minimize

15
Motivation from Web search
“Linux modem driver §¨© ª«¬®¯°±²

for a Thinkpad A22p”

³´µ¶·
³´µµ¸ ¹º»¼½¾¿À

Hyperlink path
ÁÂ»ÃÄÂÅ
³ÆÇÈÉÊËÌ Í Î
matches query ßÊË ÈüÊýÉ
³ÏÇÈÐÑ

collectively
éÈÌì ýüüýìÇÊÈ ìÇ¸Ì
³þÊÉî·

Conjunction query ³ÿìíîêÈîì

would fail ò çÕ ó ôõÛöÕÔ

Projects where X and

÷Âøä¾ ùÄùúÄÂÅ
³Î

P work together
ÒÓÔÕ Ö×ØÕ ÓÙ
³û
ÖÚÓÙÕÛÛÓÚ Ü
³Í
Conjunction may
Ý¿¾ÄÂÅ
³Þ Ïßàá
retrieve wrong page âãäÀÄ¼ãÅ

General notion of
³Î ÖæÛ çÓÔÕ è×ØÕ
³å é Ë Êêë ÊÈ ìíî à

graph proximity ¸êÊï îðìñ

¡¢¡£¤ ¥¦

“Information unit” (Lee et al., 2001)

Generalizes join trees to arbitrary graph data
Connected subgraph of data without cycles
Includes at least one node containing each
query keyword
Edge weights represent price to pay to connect
all keyword-matching nodes together
May have to include non-matching nodes

16
Setting edge weights
Edges are generally directed
Paper1
Foreign to primary key in relational data
Containing to contained element in XML
Paper2
IDREFs have clear source and target
Consider the RDMS scenario Cites

Forward edge weight for edge (u,v) Citing (Src) Cited (Dst)
Paper1 Paper2
u, v are tuples in tables R(u), R(v)
Weight s(R(u),R(v)) between tables Paper1
' ()*+,-./01 20./,34 ,56778 96301 )* 30:6*4 ,53
' ;<=>? @ABC=D=>A ?D= @AA 677 3.5 2 4 .E70 E6 ,/3 >? @
Paper2
Proximity search must traverse edges in
both directions … what should w F(u,v) be?
!" # "$% &&

Backward edge weights

“Distance” between a pair of nodes is
asymmetric in general jmm

Ted Raymond acted only in

…
The Truman Show, which is
^_``ab
jl
1 of 55 movies for Jim Carrey en jk
w(e W) should be larger than w(eX)
(think “resistance” on the edge) hhi

For every edge (u,v) that exists, c_bdefg eo

w Y(u,v)=s(R(v),R(u)) . IN Z(u)
IN [(u) is the #edges from R(v) to u
w(u,v) = min{w \(u,v), w ](u,v)}
More general edge weight models
possible, e.g., RST relation path-
based weights
GHIJ KLLK MNOPQOROQST UV

17
Node weight = relevance + prestige
Relevance w.r.t. keyword(s)
0/1: node contains term or it does not
Cosine score in [0,1] as in IR

Uniform model: a
node for each keyword

(e.g. DataSpot)
Popularity or prestige
W.p. d jump to
a random node

E.g. “mohan transaction” W.p. (1-d)

Indegree
jump to an
out-neighbor
PageRank
u.a.r.

p(v ) = + (1 − d ) ∑
d p(u )
pqrs tuut vw
N|} u →v OutDegree( u )
~
xyzx{xz

Trading off node and edge weights

A high-scoring answer A should have
Large node weight
Small edge weight
Weights must be normalized to extreme values
N(v)=node weight of v
Overall NodeScore =
∑ v ∈A
(
log 1 + N (v ) Nmax )
# nodes
Overall EdgeScore =
1
1 + ∑e∈A log(1 + w ( e )w min )
Overall score = EdgeScore × NodeScoreλ
λ tunes relative contribution of nodes and edges
Ad-hoc, but guided by heuristic choices in IR

18
Data structures for search
Answer = tree with at least one leaf
containing each keyword in query
Group Steiner tree problem, NP-hard
Query term t found in source nodes S
Single-source-shortest-path SSSP iterator
Initialize with a source (near-) node
Consider edges backwards
getNext() returns next nearest node
For each iterator, each visited node v
maintains for each t a set v.R ® of nodes in
S ® which have reached v
¡¢¢¡ £¤¥¦§¥¨¥§©ª «¬

Generic expanding search

Near node sets S ¿ with S = ∪ ¿ S ¿
For all source nodes σ ∈ S
create a SSSP iterator with source σ
While more results required
Get next iterator and its next-nearest node v
Let t be the term for the iterator’s source s
crossProduct = {s} × Π À Á≠ Âv.R Â Ã
For each tuple of nodes in crossProduct
• Create an answer tree rooted at v with paths to
each source node in the tuple
Add s to v.R Â
¯°±² ³´´³ µ¶·¸¹·º·¹»¼ ½¾

19
Search example (“Vu Kleinberg”)
Quoc Vu Jon Kleinberg

writes
writes
writes
Organizing Web pages
by “Information Unit” Authoritative sources in a
hyperlinked environment

cites
writes A metric
cites labeling problem
cites
Divyakant Agrawal writes

author paper writes cites Eva Tardos

ÄÅÆÇ ÈÉÉÈ ÊËÌÍÎÌÏÌÎÐÑ ÒÓ

First response
Quoc Vu Jon Kleinberg

writes
writes
writes
Organizing Web pages
by “Information Unit” Authoritative sources in a
hyperlinked environment

cites
writes A metric
cites labeling problem
cites
Divyakant Agrawal writes

author paper writes cites Eva Tardos

ÄÅÆÇ ÈÉÉÈ ÊËÌÍÎÌÏÌÎÐÑ ÔÉ

20
Folding in user feedback
As in IR systems, results may be imperfect
Unlike SQL or XQuery, no exact control over
matching, ranking and answer graph form
Ad-hoc choices for node and edge weights
Per-user and/or per-session
By graph/path/node type, e.g. “want author
citing author,” not “author coauthoring with
author”
Across users
Modifying edge costs to favor nodes (or node
types) liked by users
ÕÖ×Ø ÙÚÚÙ ÛÜÝÞßÝàÝßáâ ãä

Random walk formulations

Generalize PageRank to W.p. d jump to

treat outlinks differently

a random node
τù
τ(u,v) is the “conductance”
W.p. 1-d =
τ1+τ2+τ3
of edge uv τú
jump to an
τû
p(v) is a function of τ(u,v) out-neighbor

for all in-neighbors u of v p(v ) = d +

pôõö÷÷(v) … at convergence N u →v
∑ p(u ) τ (u,v )
p õ÷öø(v) … user feedback ∂p(v )
= p(u )
Gradient ascent/descent: ∂τ (u,v )
For each uv, set (with learning rate η):
τ (u,v ) ← τ (u,v ) + η sgn(puser (v ) − pguess (v ))
p(u )

Re-iterate to convergence
∑u '→v p(u' )
åæçè éêêé ëìíîïíðíïñò óé

21
Prototypes and products
DTL DataSpot Mercado Intuifind
www.mercado.com/
EasyAsk www.easyask.com/
ELIXIR www.smi.ucd.ie/elixir/
XIRQL ls6-www.informatik.uni-
dortmund.de/ir/projects/hyrex/
Microsoft DBXplorer
BANKS www.cse.iitb.ac.in/banks/

üýþÿ

Summary
Confluence of structured and free-format,
keyword-based search
Extend SQL, XQuery, Web search, IR
Many useful applications: product catalogs,
software libraries, Web search
Key idiom: proximity in a graph
representation of textual data
Implicit joins on foreign keys
Proximity via IDREF and other links
Several working systems
Not enough consensus on clean models
üýþÿ

33 Vector Space Model For XML Retrieval
No ratings yet
33 Vector Space Model For XML Retrieval
29 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
69 pages
XML Notes
No ratings yet
XML Notes
49 pages
Emutye
No ratings yet
Emutye
20 pages
XML Retrieval for CS Students
No ratings yet
XML Retrieval for CS Students
69 pages
Xquery Full-Text For The Impatient
No ratings yet
Xquery Full-Text For The Impatient
6 pages
XQuery Full Text
No ratings yet
XQuery Full Text
7 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Understanding Query Languages in Search
No ratings yet
Understanding Query Languages in Search
19 pages
Query Languages and Search Techniques
No ratings yet
Query Languages and Search Techniques
36 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Elasticsearch Crash Course Overview
No ratings yet
Elasticsearch Crash Course Overview
81 pages
Irs CH 3
No ratings yet
Irs CH 3
28 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
Ir Unit 4pt1
No ratings yet
Ir Unit 4pt1
98 pages
7 Query Languages Operations
No ratings yet
7 Query Languages Operations
12 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Querying XML Documents With Xquery
No ratings yet
Querying XML Documents With Xquery
20 pages
Query Languages
No ratings yet
Query Languages
54 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
Types of Query Languages Explained
No ratings yet
Types of Query Languages Explained
29 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Overview of Information Retrieval in CS583
No ratings yet
Overview of Information Retrieval in CS583
33 pages
Chapter Five (ISR)
No ratings yet
Chapter Five (ISR)
17 pages
Unit I
No ratings yet
Unit I
83 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
Unit 2
No ratings yet
Unit 2
58 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Module 7
No ratings yet
Module 7
53 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Comprehensive Guide to Information Retrieval
No ratings yet
Comprehensive Guide to Information Retrieval
74 pages
Comprehensive Guide to IR Models
100% (3)
Comprehensive Guide to IR Models
58 pages
Boolean Retrieval Model Overview
No ratings yet
Boolean Retrieval Model Overview
40 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Lecture 5
No ratings yet
Lecture 5
75 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages
Information Retrieval Module 1 24
No ratings yet
Information Retrieval Module 1 24
53 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Information Retrieval Course Guide
No ratings yet
Information Retrieval Course Guide
75 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
Irs 3
No ratings yet
Irs 3
14 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
XML Query Languages Overview
No ratings yet
XML Query Languages Overview
19 pages
Query Languages
No ratings yet
Query Languages
5 pages
Is Lumpy Investment Relevant For The Business Cycle?
No ratings yet
Is Lumpy Investment Relevant For The Business Cycle?
31 pages
?vfvcdtyvz - F - XV - Z - Uvcdtyczwev - Cvzyvuvdd73: Cva'Ce?'"%
No ratings yet
?vfvcdtyvz - F - XV - Z - Uvcdtyczwev - Cvzyvuvdd73: Cva'Ce?'"%
14 pages
Approximating Prices of Bonds With Log - Normal Interest Rate
No ratings yet
Approximating Prices of Bonds With Log - Normal Interest Rate
17 pages
Scalable, Tax Evasion-Free Anonymous Investing
No ratings yet
Scalable, Tax Evasion-Free Anonymous Investing
8 pages
Solution To Mock Midterm 2: 1 Allais-Baumol-Tobin Model
No ratings yet
Solution To Mock Midterm 2: 1 Allais-Baumol-Tobin Model
4 pages
NM Ad'S Statement of Disclosure: Mrose - Iesg@dbc - Mtview.ca - Us
No ratings yet
NM Ad'S Statement of Disclosure: Mrose - Iesg@dbc - Mtview.ca - Us
13 pages
Optimal Disk Packings in Squares
No ratings yet
Optimal Disk Packings in Squares
9 pages
C M S 2004 International Press Vol. 2, No. 1, Pp. 137-144: Omm. Ath. CI
No ratings yet
C M S 2004 International Press Vol. 2, No. 1, Pp. 137-144: Omm. Ath. CI
8 pages
Richard H. Lathrop: Academic Profile
No ratings yet
Richard H. Lathrop: Academic Profile
12 pages
Encouraging Cooperative Solution of Mathematics Problems
No ratings yet
Encouraging Cooperative Solution of Mathematics Problems
9 pages
Optimal Designation of Hedging Relationships Under FASB Statement 133
No ratings yet
Optimal Designation of Hedging Relationships Under FASB Statement 133
13 pages
Accessibility of Computer Science: A Re Ection For Faculty Members
No ratings yet
Accessibility of Computer Science: A Re Ection For Faculty Members
30 pages
Corporate Hedging: What, Why and How?
No ratings yet
Corporate Hedging: What, Why and How?
48 pages
Efficient Computation of Optimal Trading Strategies
No ratings yet
Efficient Computation of Optimal Trading Strategies
44 pages
Gaining Confidence in Mathematics: Instructional Technology For Girls
No ratings yet
Gaining Confidence in Mathematics: Instructional Technology For Girls
8 pages
Chap 9
No ratings yet
Chap 9
5 pages
Studies in Nonlinear Dynamics and Econometrics: Quarterly Journal Volume 4, Number 4 The MIT Press
No ratings yet
Studies in Nonlinear Dynamics and Econometrics: Quarterly Journal Volume 4, Number 4 The MIT Press
6 pages
Ewing 96 K
No ratings yet
Ewing 96 K
8 pages
Do Risk Premia Protect From Banking Crises?: Hans Gersbach Jan Wenzelburger
No ratings yet
Do Risk Premia Protect From Banking Crises?: Hans Gersbach Jan Wenzelburger
32 pages
Semantic (Web) Technology in Action: Ontology Driven Information Systems For Search, Integration and Analysis
No ratings yet
Semantic (Web) Technology in Action: Ontology Driven Information Systems For Search, Integration and Analysis
9 pages
OPSS 303 Nov09
No ratings yet
OPSS 303 Nov09
11 pages
UART Communication in Embedded Systems
No ratings yet
UART Communication in Embedded Systems
6 pages
Fosroc Conbextra EP 0211
No ratings yet
Fosroc Conbextra EP 0211
4 pages
Manual de Mantenimiento de Gruas Jib PDF
No ratings yet
Manual de Mantenimiento de Gruas Jib PDF
20 pages
SERIAL NUMBER LOOK UP SA, Inc M1A1
100% (1)
SERIAL NUMBER LOOK UP SA, Inc M1A1
2 pages
Schneider Electric Mureva-PK 81183
No ratings yet
Schneider Electric Mureva-PK 81183
3 pages
Circular Electronics Partnership Roadmap
No ratings yet
Circular Electronics Partnership Roadmap
38 pages
National Skills Qualifications Framework (NSQF) What Is NSQF?
No ratings yet
National Skills Qualifications Framework (NSQF) What Is NSQF?
3 pages
Guaranteed Technical Particulars PDF
0% (1)
Guaranteed Technical Particulars PDF
8 pages
M.Tech (Microwave & Communication Engineering) PVP17
No ratings yet
M.Tech (Microwave & Communication Engineering) PVP17
2 pages
753 Cable Gland Type: Explosion Proof
No ratings yet
753 Cable Gland Type: Explosion Proof
1 page
CIQA Installation and Operational Qualification Protocol IOQ Equipment Template
No ratings yet
CIQA Installation and Operational Qualification Protocol IOQ Equipment Template
10 pages
Chrysler Group LLC CSRs June 2010
No ratings yet
Chrysler Group LLC CSRs June 2010
8 pages
Quality (QA) Interface To Import Quality Results Into Quality Plans (Doc ID 2230498.1)
No ratings yet
Quality (QA) Interface To Import Quality Results Into Quality Plans (Doc ID 2230498.1)
18 pages
Unit Maintenance Manual for M44A2 Trucks
No ratings yet
Unit Maintenance Manual for M44A2 Trucks
1,209 pages
8533 Series 180041 PDF
No ratings yet
8533 Series 180041 PDF
61 pages
As Systemmappe Servolectric E Lowres 20150513
No ratings yet
As Systemmappe Servolectric E Lowres 20150513
18 pages
Zero-Based vs Incremental Budgeting Explained
No ratings yet
Zero-Based vs Incremental Budgeting Explained
2 pages
Nissan
No ratings yet
Nissan
24 pages
At Command Manual For ZTE Corporation's ME3000,3006 Modules (V1.6)
No ratings yet
At Command Manual For ZTE Corporation's ME3000,3006 Modules (V1.6)
58 pages
PSU650 User Manual
No ratings yet
PSU650 User Manual
8 pages
A045 H725
No ratings yet
A045 H725
12 pages
2016 Ip Video Product
No ratings yet
2016 Ip Video Product
56 pages
Substation Specs for Engineers
No ratings yet
Substation Specs for Engineers
281 pages
Unicode Proposal for Astrological Pluto Symbols
No ratings yet
Unicode Proposal for Astrological Pluto Symbols
19 pages
Account Statement - 2024 10 01 - 2024 10 01 - en Us - d89588
No ratings yet
Account Statement - 2024 10 01 - 2024 10 01 - en Us - d89588
1 page
Synergy 100
No ratings yet
Synergy 100
246 pages
MGT510 Final Term Paper 2023
No ratings yet
MGT510 Final Term Paper 2023
16 pages
MSA Ball Valves PDF
100% (1)
MSA Ball Valves PDF
24 pages
Listfile 113c
No ratings yet
Listfile 113c
671 pages

Text Search For Fine-Grained Semi-Structured Data: Soumen Chakrabarti Indian Institute of Technology, Bombay

Uploaded by

Text Search For Fine-Grained Semi-Structured Data: Soumen Chakrabarti Indian Institute of Technology, Bombay

Uploaded by

Text Search for Fine-grained

Two extreme search paradigms

Outline of this tutorial

phrase and proximity

Document boundary and

Can extend inverted index

XML DOM node IDs

Information retrieval basics

Each term t in lexicon gets a

dimension in vector space

are vectors in term space Scale up Scale

“None” = nothing more than string equality, containment

WHIRL (Cohen 1998)

   

Query execution strategy

   

Early text support in XQuery

Review of text indexing and information retrieval

ELIXIR: Adding IR to XQuery

ELIXIR XQuery filters/

WHIRL select/join filters

  !" #$% &'   !" #$9 &'

][IYINH[NYSHP  ]^[I YINHP  ]^^[NYSHP  [I YINH  [ NYSH

Using proximity without schema

¾¿ÀÁ ÂÃÃÂ ÄÅÆÇÈÆÉÆÈÊË ÌÍ

Single-node response examples

Kleiser, Movie A3 Travolta Cage

Gathering, Grease “is-a”

ÜÝÞß àááà âãäåæäçäæèé àê

Ranking a single node response

Goal: list a limited number of find nodes

Prune unpromising candidates?

  !"#$"%"$&' (

Adding IR-like text search features to XML query

Some columns have text LMNQR XYZQ[^Y_Q

Foreign key constraints =

edges in schema graph klmnopqr

Query = set of terms www SSS

No natural notion AuthorID PaperID AuthorID AuthorName

DBXplorer and DISCOVER

   

Generalized graph proximity

for a Thinkpad A22p”

Conjunction query ³ÿìíîêÈîì

would fail ò çÕ ó ôõÛöÕÔ

Projects where X and

graph proximity ¸êÊï îðìñ

   ¡¢¡£¤ ¥¦

“Information unit” (Lee et al., 2001)

Backward edge weights

Ted Raymond acted only in

For every edge (u,v) that exists, c_bdefg eo

E.g. “mohan transaction” W.p. (1-d)

Trading off node and edge weights

Generic expanding search

author paper writes cites Eva Tardos

author paper writes cites Eva Tardos

Random walk formulations

treat outlinks differently

for all in-neighbors u of v p(v ) = d +

You might also like

!" #$% &' !" #$9 &'

][IYINH[NYSHP ]^[I YINHP ]^^[NYSHP [I YINH [ NYSH

!"#$"%"$&' (

¡¢¡£¤ ¥¦