Text Search for Fine-grained
Semi-structured Data
Soumen Chakrabarti
Indian Institute of Technology, Bombay
www.cse.iitb.ac.in/~soumen/
Acknowledgments
S. Sudarshan Arvind Hulgeri
B. Aditya Parag
Two extreme search paradigms
Searching a RDBMS Information Retrieval
Complex data model: Collection = set of
tables, rows, documents, document
columns, data types = sequence of terms
Expressive, powerful Terms and phrases
query language present or absent
Need to know No (nontrivial)
schema to query schema to learn
Answer = unordered Answer = sequence
set of rows of documents
Ranking: afterthought Ranking: central to IR
1
Convergence?
SQLXML search Web searchIR
Trees, reference links Documents are nodes
Labeled edges in a graph
Nodes may contain Hyperlink edges have
Structured data important but
Free text fields unspecified semantics
Data vs. document Google, HITS
Query involves node Query language
data and edge labels remains primitive
Partial knowledge of No data types
schema ok No use of tag-tree
Answer = set of paths Answer = URL list
Outline of this tutorial
Review of text indexing and
information retrieval (IR)
Support for text search and similarity join in
relational databases with text columns
Text search features in major XML query
languages (and what’s missing)
A graph model for semi-structured data with
“free-form” text in nodes
Proximity search formulations and techniques;
how to rank responses
Folding in user feedback
Trends and research problems
!""! #$%&'%(%')* +
2
Text indexing basics
“Inverted index” maps from
term to document IDs ;<= >?@AB CD EFDD FG >?@A
Term offset info enables
D1
HCIJ FEK >?@A KFLA
phrase and proximity
(“near”) searches
MFN@ >?@A CD O?CL FG
>?@A HCIJ LAH >?@A HFL
Document boundary and
limitations of “near” queries
>?@A D2
D1: 1, 5, 8
Can extend inverted index
D2: 1, 5, 8
to map terms to
LAH
D2: 7
FEK
Table names, column names
D1: 7
EFDD
Primary keys, RIDs
D1: 3
XML DOM node IDs
,-./ 0110 23456474689 :
Information retrieval basics
Stopwords and stemming `abc
Each term t in lexicon gets a
hijcbk lc`mebn
dimension in vector space
Documents and the query
deff
eg
are vectors in term space Scale up Scale
down
Component of d along axis t is TF(d,t)
Absolute term count or scaled by max term count
Downplay frequent terms: IDF(t) = log(1+|D|/|D _|)
Better model: document vector d has component
TF(d,t) IDF(t) for term t
Query is like another “document”; documents
ranked by cosine similarity with query
PQRS TUUT VWXYZX[XZ\] ^
3
Map
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER
“None” = nothing more than string equality, containment
(substring), and perhaps lexicographic ordering
“Schema”: Extensions to query languages, user needs to
know data schema, IR-like ranking schemes, no implicit
joins
“No schema”: Keyword queries, implicit joins
opqr stts uvwxywzwy{| }
WHIRL (Cohen 1998)
place(univ,state) and job(univ,dept)
Ranked retrieval from a RDBMS:
select univ from job where dept ~ ‘Civil’
Ranked similarity join on text columns:
select state, dept from place, job
where place.univ ~ job.univ
Limit answer to best k matches only
Avoid evaluating full Cartesian product
“Iceberg” query
Useful for data cleaning and integration
opqr stts uvwxywzwy{| ~
4
WHIRL scoring function
A where-clause in WHIRL is a
Boolean predicate as in SQL (age=35)
Score for such clauses are 0/1
Similarity predicate (job ~ ‘Web design’)
Score = cosine(job, ‘Web design’)
Conjunction or disjunction of clauses
Sub-clause scores interpreted as probabilities
score(B ∧ … ∧B ; θ)=Π ≤ ≤ score(B ,θ)
score(B ∨ … ∨B ; θ)=1 — Π ≤ ≤ (1—score(B ,θ))
Query execution strategy
select state, dept from place, job
where place.univ ~ job.univ
Start with place(U1,S) and job(U2,D)
where U1, U2, S and D are “free”
Any binding of these variables to constants is
associated with a score
Greedily extend the current bindings for
maximum gain in score
Backtrack to find more solutions
5
XQuery
Quilt + Lorel + YATL + XML-QL
Path expressions recipes.xml
<dishes_with_flour> { FOR $r IN
document("recipes.xml")
//recipe[//ingredient[@name="flour"]]
RETURN <dish>{$r/title/text()}</dish> }
</dishes_with_flour>
recipe $r
title
name
Tortilla ingredient “flour”
¡¢£ ¤¥¥¤ ¦§¨©ª¨«¨ª¬ ®®
Early text support in XQuery
Title of books containing some para mentioning
both “sailing” and “windsurfing”
FOR $b IN document("bib.xml")//book
WHERE SOME $p IN $b//paragraph SATISFIES
(contains($p,"sailing") AND
contains($p,"windsurfing"))
RETURN $b/title
Title and text of documents containing at least
three occurrences of “stocks”
FOR $a IN view("text_table") WHERE
numMatches($a/text_document,"stocks") > 3
RETURN
<text>{$a/text_title}{$a/text_document}</>
¯°±² ³´´³ µ¶·¸¹·º·¹»¼ ½³
6
Tutorial outline
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER
Review of text indexing and information retrieval
Support for text search and similarity join in
relational databases with text columns (WHIRL)
Adding IR-like text search features to XML query
languages (Chinenyanga et al. Führ et al. 2001)
¾¿ÀÁ ÂÃàÄÅÆÇÈÆÉÆÈÊË ÌÍ
ELIXIR: Adding IR to XQuery
Ranked select
for $t in document(“db.xml”)/items/(book|cd)
where $t/text() ~ “Ukrainian recipe”
return <dish>$t</dish>
Ranked similarity join: find titles in recent
VLDB proceedings similar to speeches in
Macbeth
for $vi in
document(“vldb.xml”)/issue[@volume>24],
$si in document(“macbeth.xml”)//speech
where $vi//article/title ~ $si
return <similar><title>$vi//article/title</>
<speech>$si</></similar>
¾¿ÀÁ ÂÃàÄÅÆÇÈÆÉÆÈÊË ÌÎ
7
How ELIXIR works
ELIXIR VLDB.xml Macbeth.xml Base XML
query documents
ELIXIR XQuery filters/
Compiler transformers
Flatten to WHIRL
WHIRL select/join filters
Rewrite to XML
Result
ÏÐÑÒ ÓÔÔÓ ÕÖרÙ×Ú×ÙÛÜ ÝÞ
ÿ
A more detailed view
àáââãäåàæçèãéäåêëàìå àíðï üãé äî ñ
å
àíîï áðèäåñàìå ñàìå àâðäüä üãé äî ñ
å
àáââãäåàæçèãéäåòóàìå àâýääðåç
îäèíü
çãî
àíîï áðèäåàï áï èäåôáõä ö÷øùúùûä ç îï ãüä àìå
ö÷øùúùû áçü âýíï áíè þçáüàìåñàìåàìå àìåàìå
!" #$% &' !" #$9 &'
(!)*+,'%-./01234+56788&99*, (!)*+,'%-.DEFGHIJKLMNOPQQEFIQRFHSHQRTHHFJ
:;!5*+, <=88%&%5,
UHIVUS WIVTNHXWNYSHXZ [ER \WQXWQIVTNHX \ W Q]^^X
",%*"'
%*>5,%&%5, #$% ?88%*>5, ? 8 {||}~
BC
A _`aab_cdefgb_ fhigbjk lmgfnio p
à@ òêåàï ãýèäåàï áïèäåôáõä ö÷øùúùû áçü l q k d m rstuvuwgo xkmcdig y
âýíï áíè þçáüàì ï áïèäåàìï ãýèäåàì@ òêå _zfhigb_zcdefgb_z`aab
][IYINH[NYSHP
]^[I YINHP ]^^[NYSHP [I YINH [ NYSH
WHIRL query
Result
ÏÐÑÒ ÓÔÔÓ ÕÖרÙ×Ú×ÙÛÜ Ýß
8
Observations
SQL/XQuery + IR-like result ranking
Schema knowledge remains essential
“Free-form” text vs. tagged, typed field
Element hierarchy, element names,
IDREFs
Typical Web search is two words long
End-users don’t type SQL or XQuery
Possible remedy: HTML form access
Limitation: restricted views and queries
¡ ¢££¢ ¤¥¦§¨¦©¦¨ª« ¬
Using proximity without schema
General, detailed representation: XML
Lowest common representation
Collection, document, terms
Document = node, hyperlink = edge
Middle ground
Graph with text (or structured data) in nodes
Links: element, subpart, IDREF, foreign keys
All links hint at unspecified notion of proximity
Exploit structure where available, but do not
impose structure by fiat
®¯°± ²³³² ´µ¶·¸¶¹¶¸º» ¼½
9
Two paradigms of proximity search
A single node as query response
Find node that matches query terms…
…or is “near” nodes matching query terms
(Goldman et al., 1998)
A connected subgraph as query response
Single node may not match all keywords
No natural “page boundary”
¾¿ÀÁ ÂÃàÄÅÆÇÈÆÉÆÈÊË ÌÍ
Single-node response examples
Travolta, Cage Movie
Actor, Face/Off “is-a”
Travolta, Cage, Gathering Grease Face/Off
Movie
Face/Off “acted-in”
Kleiser, Movie A3 Travolta Cage
“directed”
Gathering, Grease “is-a”
Kleiser, Woo, Actor
Actor
Travolta Kleiser Woo
“is-a”
Director
ÎÏÐÑ ÒÓÓÒ ÔÕÖרÖÙÖØÚÛ ÒÓ
10
Basic search strategy
Node subset A activated because they
match query keyword(s)
Look for node near nodes that are
activated
Goodness of response node depends
Directly on degree of activation
Inversely on distance from activated node(s)
ÜÝÞß àááà âãäåæäçäæèé àê
Ranking a single node response
Activated node set A
Rank node r in “response set” R based
on proximity to nodes a in A
Nodes have relevance ρ ù and ρú in [0,1]
Edge costs are “specified by the system”
d(a,r) = cost of shortest path from a to r
Bond between a and r ρ (a ) ρ R ( r )
b(a, r ) = A
d (a, r )t
Parameter t tunes relative emphasis on
distance and relevance score
Several ad-hoc choices
ëìíî ïððï ñòóôõóöóõ÷ø ïï
11
Scoring single response nodes
Additive
score(r ) = ∑a∈A b(a, r )
Belief
score(r ) = 1 − ∏a∈A (1 − b(a, r ))
Goal: list a limited number of find nodes
with the largest scores
Performance issues
Assume the graph is in memory?
Precompute all-pairs shortest path (|V | )?
Prune unpromising candidates?
ûüýþ ÿ ÿ ÿ
Hub indexing
Decompose APSP problem using sparse
vertex cuts
|A|+|B | shortest paths to p A B
|A|+|B | shortest paths to q
d(p,q) p
a b
To find d(a,b) compare
d(apb) not through q q
d(aqb) not through p
d(apqb)
d(aqpb)
Greatest savings when |A|≈|B|
Heuristics to find cuts, e.g. large-degree nodes
12
Connected subgraph as response
Single node may not match all keywords
No natural “page boundary”
Two scenarios
Keyword search on relational data
• Keywords spread among normalized relations
Keyword search on XML-like or Web data
• Keywords spread among DOM nodes and
subtrees
!"#$"%"$&' (
Tutorial outline
Data model
Relational XML-like
None SQL,Datalog XML-QL, Xquery
Schema WHIRL ELIXIR, XIRQL
IR
DBXplorer,
support No EasyAsk, Mercado,
BANKS,
schema DataSpot, BANKS
DISCOVER
Adding IR-like text search features to XML query
languages
A graph model for relational data with “free-form”
text search and implicit joins
Generalizing to graph models for XML
)*+, -..- /0123141356 -7
13
Keyword search on relational data
Tuple = node
GHIJK TUVJW
LMN MOP XYZQ[\]
Some columns have text LMNQR XYZQ[^Y_Q
Foreign key constraints =
SSS SSS
efghij ` WHIJK
edges in schema graph klmnopqr
klmnopstuv
abNcd [\]
XYZQ[\]
Query = set of terms www SSS
No natural notion AuthorID PaperID AuthorID AuthorName
of a document A1 P1 A1 Chaudhuri
Normalization
A2 P2 A2 Sudarshan
A3 P2 A3 Hulgeri
Join may be needed Citing Cited PaperID PaperName
to generate results P2 P1 P1 DBXplorer
Cycles may exist in P2 BANKS
schema graph: ‘Cites’
89:; <==< >?@AB@C@BDE <F
DBXplorer and DISCOVER
Enumerate subsets of relations in schema graph
which, when joined, may contain rows which
have all keywords in the query
“Join trees” derived from schema graph
Output SQL query for each join tree
Generate joins, checking rows for matches
(Agrawal et al. 2001, Hristidis et al. 2002)
T4
K1,K2,K3
T4 T2 T2 T3
T1 T2 T3 K2 T5
T4 T2 T3
T5
T2 ~
T3
T5
xyz{ |}}| K3 |
14
Discussion
Exploits relational Coarse-grained
schema information to ranking based on
contain search schema tree
Pushes final Does not model
extraction of joined proximity or (dis)
tuples into RDBMS similarity of individual
Faster than dealing tuples
with full data graph No recipe for data
directly with less regular (e.g.
XML) or ill-defined
schema
Generalized graph proximity
General data graph
Nodes have text, can be scored against query
Edge weights express dissimilarity
Query is a set of keywords as before
Response is a connected subgraph of the
database
Each response graph is scored using
Node weights which reflect match, maximize
Edge weights which reflect lack of proximity,
minimize
15
Motivation from Web search
“Linux modem driver §¨© ª«¬®¯°±²
for a Thinkpad A22p”
³´µ¶·
³´µµ¸ ¹º»¼½¾¿À
Hyperlink path
Á»ÃÄÂÅ
³ÆÇÈÉÊËÌ Í Î
matches query ßÊË ÈüÊýÉ
³ÏÇÈÐÑ
collectively
éÈÌì ýüüýìÇÊÈ ìǸÌ
³þÊÉî·
Conjunction query ³ÿìíîêÈîì
would fail ò çÕ ó ôõÛöÕÔ
Projects where X and
÷Âøä¾ ùÄùúÄÂÅ
³Î
P work together
ÒÓÔÕ Ö×ØÕ ÓÙ
³û
ÖÚÓÙÕÛÛÓÚ Ü
³Í
Conjunction may
Ý¿¾ÄÂÅ
³Þ Ïßàá
retrieve wrong page âãäÀļãÅ
General notion of
³Î ÖæÛ çÓÔÕ èרÕ
³å é Ë Êêë ÊÈ ìíî à
graph proximity ¸êÊï îðìñ
¡¢¡£¤ ¥¦
“Information unit” (Lee et al., 2001)
Generalizes join trees to arbitrary graph data
Connected subgraph of data without cycles
Includes at least one node containing each
query keyword
Edge weights represent price to pay to connect
all keyword-matching nodes together
May have to include non-matching nodes
16
Setting edge weights
Edges are generally directed
Paper1
Foreign to primary key in relational data
Containing to contained element in XML
Paper2
IDREFs have clear source and target
Consider the RDMS scenario Cites
Forward edge weight for edge (u,v) Citing (Src) Cited (Dst)
Paper1 Paper2
u, v are tuples in tables R(u), R(v)
Weight s(R(u),R(v)) between tables Paper1
' ()*+,-./01 20./,34 ,56778 96301 )* 30:6*4 ,53
' ;<=>? @ABC=D=>A ?D= @AA 677 3.5 2 4 .E70 E6 ,/3 >? @
Paper2
Proximity search must traverse edges in
both directions … what should w F(u,v) be?
!" # "$% &&
Backward edge weights
“Distance” between a pair of nodes is
asymmetric in general jmm
Ted Raymond acted only in
…
The Truman Show, which is
^_``ab
jl
1 of 55 movies for Jim Carrey en jk
w(e W) should be larger than w(eX)
(think “resistance” on the edge) hhi
For every edge (u,v) that exists, c_bdefg eo
w Y(u,v)=s(R(v),R(u)) . IN Z(u)
IN [(u) is the #edges from R(v) to u
w(u,v) = min{w \(u,v), w ](u,v)}
More general edge weight models
possible, e.g., RST relation path-
based weights
GHIJ KLLK MNOPQOROQST UV
17
Node weight = relevance + prestige
Relevance w.r.t. keyword(s)
0/1: node contains term or it does not
Cosine score in [0,1] as in IR
Uniform model: a
node for each keyword
(e.g. DataSpot)
Popularity or prestige
W.p. d jump to
a random node
E.g. “mohan transaction” W.p. (1-d)
Indegree
jump to an
out-neighbor
PageRank
u.a.r.
p(v ) = + (1 − d ) ∑
d p(u )
pqrs tuut vw
N|} u →v OutDegree( u )
~
xyzx{xz
Trading off node and edge weights
A high-scoring answer A should have
Large node weight
Small edge weight
Weights must be normalized to extreme values
N(v)=node weight of v
Overall NodeScore =
∑ v ∈A
(
log 1 + N (v ) Nmax )
# nodes
Overall EdgeScore =
1
1 + ∑e∈A log(1 + w ( e )w min )
Overall score = EdgeScore × NodeScoreλ
λ tunes relative contribution of nodes and edges
Ad-hoc, but guided by heuristic choices in IR
18
Data structures for search
Answer = tree with at least one leaf
containing each keyword in query
Group Steiner tree problem, NP-hard
Query term t found in source nodes S
Single-source-shortest-path SSSP iterator
Initialize with a source (near-) node
Consider edges backwards
getNext() returns next nearest node
For each iterator, each visited node v
maintains for each t a set v.R ® of nodes in
S ® which have reached v
¡¢¢¡ £¤¥¦§¥¨¥§©ª «¬
Generic expanding search
Near node sets S ¿ with S = ∪ ¿ S ¿
For all source nodes σ ∈ S
create a SSSP iterator with source σ
While more results required
Get next iterator and its next-nearest node v
Let t be the term for the iterator’s source s
crossProduct = {s} × Π À Á≠ Âv.R  Ã
For each tuple of nodes in crossProduct
• Create an answer tree rooted at v with paths to
each source node in the tuple
Add s to v.R Â
¯°±² ³´´³ µ¶·¸¹·º·¹»¼ ½¾
19
Search example (“Vu Kleinberg”)
Quoc Vu Jon Kleinberg
writes
writes
writes
Organizing Web pages
by “Information Unit” Authoritative sources in a
hyperlinked environment
cites
writes A metric
cites labeling problem
cites
Divyakant Agrawal writes
author paper writes cites Eva Tardos
ÄÅÆÇ ÈÉÉÈ ÊËÌÍÎÌÏÌÎÐÑ ÒÓ
First response
Quoc Vu Jon Kleinberg
writes
writes
writes
Organizing Web pages
by “Information Unit” Authoritative sources in a
hyperlinked environment
cites
writes A metric
cites labeling problem
cites
Divyakant Agrawal writes
author paper writes cites Eva Tardos
ÄÅÆÇ ÈÉÉÈ ÊËÌÍÎÌÏÌÎÐÑ ÔÉ
20
Folding in user feedback
As in IR systems, results may be imperfect
Unlike SQL or XQuery, no exact control over
matching, ranking and answer graph form
Ad-hoc choices for node and edge weights
Per-user and/or per-session
By graph/path/node type, e.g. “want author
citing author,” not “author coauthoring with
author”
Across users
Modifying edge costs to favor nodes (or node
types) liked by users
ÕÖר ÙÚÚÙ ÛÜÝÞßÝàÝßáâ ãä
Random walk formulations
Generalize PageRank to W.p. d jump to
treat outlinks differently
a random node
τù
τ(u,v) is the “conductance”
W.p. 1-d =
τ1+τ2+τ3
of edge uv τú
jump to an
τû
p(v) is a function of τ(u,v) out-neighbor
for all in-neighbors u of v p(v ) = d +
pôõö÷÷(v) … at convergence N u →v
∑ p(u ) τ (u,v )
p õ÷öø(v) … user feedback ∂p(v )
= p(u )
Gradient ascent/descent: ∂τ (u,v )
For each uv, set (with learning rate η):
τ (u,v ) ← τ (u,v ) + η sgn(puser (v ) − pguess (v ))
p(u )
Re-iterate to convergence
∑u '→v p(u' )
åæçè éêêé ëìíîïíðíïñò óé
21
Prototypes and products
DTL DataSpot Mercado Intuifind
www.mercado.com/
EasyAsk www.easyask.com/
ELIXIR www.smi.ucd.ie/elixir/
XIRQL ls6-www.informatik.uni-
dortmund.de/ir/projects/hyrex/
Microsoft DBXplorer
BANKS www.cse.iitb.ac.in/banks/
üýþÿ
Summary
Confluence of structured and free-format,
keyword-based search
Extend SQL, XQuery, Web search, IR
Many useful applications: product catalogs,
software libraries, Web search
Key idiom: proximity in a graph
representation of textual data
Implicit joins on foreign keys
Proximity via IDREF and other links
Several working systems
Not enough consensus on clean models
üýþÿ
22