Clustering Web Search Results
Iwona Biaynicka-Birula
Overview
What is clustering?
Applying clustering to web search results
Clustering algorithms
Case studies
Related topics not covered
Clustering
Clustering in general
Document clustering in general
Other search and browsing aids
Classification
Visualization
Query expansion
Iwona Biaynicka-Birula - Clustering Web Search Results
What is clustering?
Clustering the act of grouping similar
object into sets
In the web search context:
organizing web pages (search results)
into groups, so that different groups
correspond to different user needs
search engine
i.e.: engine
car part
Engine Corp.
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering vs. Classification
Classification assigns objects to
predefined groups
Clustering infers groups based on
clustered objects
Iwona Biaynicka-Birula - Clustering Web Search Results
Why cluster web search results?
Flat ranked list not enough
Relationships between the results
Documents pertaining to different topics
cannot be compared
Cluster Hypothesis (van Rijsbergen 1979):
Closely related documents tend to be relevant
to the same requests.
Aids user-engine interaction
Browsing
Help user express his need
Iwona Biaynicka-Birula - Clustering Web Search Results
Why not just document clustering?
Web search results clustering is a
version of document clustering,
but
Billions of pages
Constantly changing
Data mainly unstructured and
heterogeneous
Additional information to consider
(i.e. links, click-through data, etc.)
Iwona Biaynicka-Birula - Clustering Web Search Results
Some requirements
Fast
Flexible
Immediate response to query
Web content changes constantly
User-oriented
Main goal is to aid the user in finding
sought information
Iwona Biaynicka-Birula - Clustering Web Search Results
Main issues
Online or offline clustering?
What to use as input
How to define similarity?
Entire documents
Snippets
Structure information (links)
Other data (i.e. click-through)
Use stop word lists, stemming, etc.
Content (i.e. vector-space model)
Link analysis
Usage statistics
How to group similar documents?
How to label the groups?
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering algorithms
Flat or hierarchical?
Overlapping?
Hard or soft?
Incremental?
Predefined cluster number?
Requiring explicit similarity
measure? Distance measure?
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering algorithms
Distance-based
Hierarchical
Agglomerative Hierarchical Clustering (AHC)
Flat
K-means (can be fuzzy)
Single-pass (incremental)
Other
Suffix Tree Clustering (Grouper)
Self-organizing (Kohonen) maps (neural
networks)
Latent Semantic Indexing (LSI) (reducing the
dimensionality of the vector-space)
Iwona Biaynicka-Birula - Clustering Web Search Results
Agglomerative hierarchical clustering
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering result: dendrogram
Iwona Biaynicka-Birula - Clustering Web Search Results
AHC variants
Various ways of calculating cluster
similarity
single-link
complete-link
(minimum)
(maximum)
Group-average
(average)
Iwona Biaynicka-Birula - Clustering Web Search Results
K-means clustering (k=3)
Iwona Biaynicka-Birula - Clustering Web Search Results
Single-pass
Iwona Biaynicka-Birula - Clustering Web Search Results
Selected systems
Scatter/Gather
Grouper
Carrot2
Vivisimo
Mapuccino
(Su et. al. 2001)
SHOC
Iwona Biaynicka-Birula - Clustering Web Search Results
Scatter/Gather
(Cutting et. al. 1992)
Designed for browsing
Based on two novel clustering
algorithms
Buckshot fast for online clustering
Fractionation accurate for offline
initial clustering of the entire set
Iwona Biaynicka-Birula - Clustering Web Search Results
Grouper
(Zamir and Etzioni 1997, 1999)
Online
Operates on query result snippets
Clusters together documents with
large common subphrases
Suffix Tree Clustering (STC)
STC induces labeling
Iwona Biaynicka-Birula - Clustering Web Search Results
Suffix Tree Clustering (STC)
Linear
Incremental
Overlapping
Can be extended to hierarchical
Iwona Biaynicka-Birula - Clustering Web Search Results
STC algorithm
Step 1: Cleaning
Step 2: Suffix tree construction
Stemming
Sentence boundary identification
Punctuation elimination
Produces base clusters (internal nodes)
Base clusters are scored based on size and
phrase score (which depends on length and
word quality)
Step 3: Merging base clusters
Highly overlapping clusters are merged
Iwona Biaynicka-Birula - Clustering Web Search Results
Carrot2
(Stefanowski and Weiss 2003)
http://www.cs.put.poznan.pl/dweiss/carr
ot/
Component framework
Allows substituting components for
Input (i.e. snippets from other search engines)
Filter
Stemming
Distance measure
Clustering
Output
Iwona Biaynicka-Birula - Clustering Web Search Results
Vivsimo
Commercial
http://www.vivisimo.com/
Online
Hierarchical
Conceptual
Iwona Biaynicka-Birula - Clustering Web Search Results
Other
Mapuccino (IBM)
(Su et. al. 2001)
(Maarek et. al. 2000)
http://www.alphaworks.ibm.com/tech/mapuccino
Relatively efficient AHC (O(n2))
Similarity based on vector-space model
Only usage statistics used as input
Recursive Density Based Clustering
SHOC
(Zhang and Dong 2004)
Grouper-like
Key phrase discovery
Iwona Biaynicka-Birula - Clustering Web Search Results
References
Douglass Cutting, David Karger, Jan Pedersen, and John W.
Tukey, Scatter/Gather: A Cluster-based Approach to Browsing
Large Document Collections, 1992.
Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.
O. Zamir and O. Etzioni, Grouper: a dynamic clustering
interface to web search results, May 1999.
In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.
Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral
document clustering for web applications, 2000.
Technical Report RJ 10186, IBM Research
Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen
Hu, Correlation-based Document Clustering using Web
Logs, 2001.
J. Stefanowski, D. Weiss. Carrot2 and Language Properties in
Web Search Results Clustering, 2003.
In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First
International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (), pp. 240249
Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online
Clustering of Web Search Results, Apr 2004.
In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China
Iwona Biaynicka-Birula - Clustering Web Search Results
Thank you
Questions?
http://www.di.unipi.it/~iwona/Clust
ering.ppt
Iwona Biaynicka-Birula - Clustering Web Search Results