SINGLE PASS
CLUSTERING
ALGORITHM
INFROMATION RETRIEVAL SYSTEM
BTech VII Sem
Code: CC4151
SINGLE PASS CLUSTERING
ALGORITHM
• The single pass method is particularly simple since it requires that the
data set be processed only once. The general algorithm is as follows:
• 1. Assign the first document D1 as the representative for C1.
• 2. For Di, calculate the similarity S with each cluster’s representative.
• 3. If Smax is greater than a threshold value ST, add the item to the
corresponding cluster and recalculate the cluster representative;
otherwise, use Di to initiate a new cluster.
• 4. If an item Di remains to be clustered, return to step 2.
CONT…
• Though the single pass method has the advantage of simplicity, it is often
criticized for its tendency to produce large clusters early in the clustering
pass, and because the clusters formed are not independent of the order in
which the data set is processed.
• It is sometimes used to form the groups that are used to initiate reallocation
clustering.
• In this algorithm, a set of documents is selected as cluster seeds, and then
each document is assigned to the cluster seed that maximally covers it.
For Di, the cover coefficient is a measure that incorporates the extent to
which it is covered by Dj and the uniqueness of Di, that is, the extent to
which it is covered by itself.
EXAMPLE
• Suppose that we have the following set of documents and terms,
and that we are interested in clustering the terms using the single
pass method (Note that the same method can be used to cluster the
documents, but in that case, we would be using the document
vectors (rows) rather than the term vector (columns).
SOLUTION
• Assume that our threshold is 10
• Start with T1 in a cluster by itself, say C1. At this point, C1
contains only one item, T1, so the centroid of C1 is simply the
vector for T1:
• C1 = <1, 3, 3, 2, 2>.
• Now compare (i.e., measure similarities) of the next item (T2) to
centroids of all existing clusters. At this point we have only one
cluster, C1 (we will use dot product for simplicity):
• SIM(T2, C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11
SOLUTION CONT…..
• Now we need a pre-specified similarity threshold. This means that if
the similarity of T2 to the cluster centroid is >= 10, then we add T2 to
the cluster, otherwise we use T2 to start a new cluster.
• In this case. SIM(T2, C1) = 11 > 10. Therefore, we add T2 to cluster
C1.
• We now need to compute the new centroid for C1 (which now
contains T1 and T2). The centroid (which is the average vector for T1
and T2 is:
• C1 = <3/2, 4/2, 3/2, 3/2, 4/2>
SOLUTION CONT…..
• Now, we move to the next item, T3. Again, there is only one cluster, C1, so we only
need to compare T3 with C1 centroid. The dot product of T3 and
• the above centroid is:
• SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6
• This time, T3 does not pass the threshold test (the similarity is less than 10).
Therefore, we use T3 to start a new cluster, C2. Now we have two clusters
• C1 = {T1, T2}=<3/2, 4/2, 3/2, 3/2, 4/2>
• C2 = {T3}=<0,2, 0, 0, 1>
SOLUTION CONT…..
• We move to the next unclustered item, T4. Since we now have two
clusters, we need to compute the MAX similarity of T4 to the 2 cluster
centroids.
• (note that the centroid of cluster C2 right now is just the vector for T3):
• SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2>
• = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5
• SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1>
• = 0 + 6 + 0 + 0 + 5 = 11
SOLUTION CONT…..
• Note that both similarity scores pass the threshold (10), however, we pick the
MAX, and therefore, T4 will be added to cluster C1. Now we have the
following:
• C1 = {T1, T2, T4}
• C2 = {T3}
• The centroid for C2 is still just the vector for T3:
• C2 = <0, 2, 0, 0, 1>
• and the new centroid for C1 is now:
• C1 = <3/3, 7/3, 3/3, 6/3, 9/3>
SOLUTION CONT…..
• The only item left unclustered is T5. We compute its similarity to the centroids of
existing clusters:
• SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3>
• = 3/3 + 0 + 3/3 + 0 + 9/3 = 5
• SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1>
• = 0 + 0 + 0 + 0 +1 = 1
• Neither of these similarity values pass the threshold. Therefore, T5 will have to go
into a new cluster C3. There are no more unclustered items, so we are done (after
making a single pass through the items). The final clusters are:
SOLUTION CONT…..
• C1 = {T1, T2, T4}
• C2 = {T3}
• C3 = {T5}
DENDROGRAM
• A dendrogram is a diagram that shows the hierarchical
relationship between objects.
• It is mostly created as an output from hierarchical clustering.
• The main use of a dendrogram is to work out the best way to
allocate objects to clusters.
• The dendrogram below shows the hierarchical clustering of
six observations shown on the scatterplot to the left.
EXAMPLE
• C1 = {T1, T2, T4}
• C2 = {T3}
• C3 = {T5}
T1 T2 T4 T3 T5
EXAMPLE
Suppose that we have the following set of
documents and terms. We are interested in
clustering the terms by measuring similarity.
Assume, the prespecified similarity threshold is
10. Apply single pass clustering algorithm on the
following data to construct clusters.