FLAT CLUSTERING &
HIERARCHICAL
CLUSTERING in I.R.
By:
Suraj Jogani (20117101)
Pankaj Agarwal (20117068)
Aditya Kumar Dubey(20117901)
Electrical Engineering
6th Semester
What is clustering?
Grouping set of documents into subsets or clusters.
The Goal of clustering algorithm is: To create clusters that are coherent internally,
but clearly different from each other
Documents within a cluster should be as similar as possible; and
Documents in one cluster should be as dissimilar as possible from documents in
other clusters
Clustering algorithms
Flat algorithm
Usually start with a random (partial) partitioning
Refine it iteratively by changing the centroid
K-Means clustering
Model based clustering
Hierarchical algorithm
Hierarchical algorithms are algorithms where you also have the explicit notion of a
hierarchy
In hierarchical we can cluster our documents into certain no of clusters and then we can
group together those clusters in turn into larger clusters and so on to finally have a
hierarchy.
Bottom up ,agglomerative
Top down ,Divisive
Hard vs soft clustering
Another way to classify clustering is:
Hard clustering: Each documents belong to exactly one
cluster.
Soft clustering: A document can belongs to more than one
cluster.
Ex—You may want to put a pair of sneakers in two clusters (i)
sports apparels (ii) shoes
Soft clustering is not used only hard clustering is used today.
K-Means
K-Means is the important flat clustering algorithm.
Its objective is to minimize the average squared Euclidean distance of
documents from their cluster centers .
where a cluster center is defined as the mean or centroid.
The first step of K-Means and our goal is to select seeds- is a initial cluster
centers K randomly selected documents.
The algorithm then moves the cluster centers around in space to minimize
RSS(residual sum of squares)
The square distance of each vector from its centroid summed over all the
vectors.
A k-means example of k=2
Hierarchy clustering
Hierarchical clustering outputs a hierarchy ,
a structure that is more informative than the unstructured set
of clusters returned by flat clustering.
Hierarchical clustering does not require us to prespecify the
number of clusters and most hierarchical algorithm are
deterministic.
Hierarchical clustering produce better result than flat
clustering
Hierarchical Agglomerative
clustering
Hierarchical clustering algorithm are either top-down or bottom-up.
Bottom-up algorithms treat each document as a singleton clusters at the outset and then successively merge pairs of
clusters until all clusters have been merged into a single cluster that contains all documents.
Bottom up hierarchical clustering is therefore called hierarchical agglomerative clustering(HAC)
Agglomerative hierarchical clustering presents four different agglomerative algorithms.
1.Single link
2.Complete link
3.Group-average
4.Centroid similarity.
Single-link and complete-link clustering
Single link clustering: In single link or single linkage clustering ,the similarity of two
clusters is the similarity of their most similar members
Complete link clustering: In complete link clustering or complete linkage clustering, the
similarity of two clusters is the similarity of their most dissimilar members.
Divisive clustering
This variant of hierarchical clustering is called top-down
clustering or divisive clustering.
We start at the top with all documents in one cluster. the
cluster is split using a flat clustering algorithm.
This procedure is applied recursively until each document is
in its own singleton cluster.
Why cluster documents?
• Better user interface
• Better search results
• Effective “user recall”(relevant
documents retrieved) will be
higher
• Faster search
Some application of clustering
in Information Retrieval
Application What is clustered? Benefit
Search result clustering search more effective information
results presentation to user
Scatter-Gather (subsets of) alternative user interface:
collection “search without typing”
Collection clustering collection effective information
presentation
for exploratory
browsing
Language modeling collection increased precision and/or
recall
Cluster-based retrieval collection higher efficiency: faster
search
Thank you