0% found this document useful (0 votes)

15 views15 pages

Lecture-8-Single Pass Clustering Algorithm

The document outlines the Single Pass Clustering Algorithm used in information retrieval systems, which processes a dataset in a single pass to form clusters based on similarity measures. It describes the steps involved in assigning documents to clusters, recalculating cluster representatives, and the potential issues such as dependency on data order and tendency to create large clusters. An example illustrates the algorithm's application with a specified similarity threshold, resulting in the formation of distinct clusters.

Uploaded by

sanchimishra09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views15 pages

Lecture-8-Single Pass Clustering Algorithm

Uploaded by

sanchimishra09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

SINGLE PASS

CLUSTERING
ALGORITHM
INFROMATION RETRIEVAL SYSTEM

BTech VII Sem

Code: CC4151
SINGLE PASS CLUSTERING
ALGORITHM
• The single pass method is particularly simple since it requires that the
data set be processed only once. The general algorithm is as follows:
• 1. Assign the first document D1 as the representative for C1.

• 2. For Di, calculate the similarity S with each cluster’s representative.

• 3. If Smax is greater than a threshold value ST, add the item to the
corresponding cluster and recalculate the cluster representative;
otherwise, use Di to initiate a new cluster.

• 4. If an item Di remains to be clustered, return to step 2.

CONT…

• Though the single pass method has the advantage of simplicity, it is often
criticized for its tendency to produce large clusters early in the clustering
pass, and because the clusters formed are not independent of the order in
which the data set is processed.
• It is sometimes used to form the groups that are used to initiate reallocation
clustering.
• In this algorithm, a set of documents is selected as cluster seeds, and then
each document is assigned to the cluster seed that maximally covers it.
For Di, the cover coefficient is a measure that incorporates the extent to
which it is covered by Dj and the uniqueness of Di, that is, the extent to
which it is covered by itself.
EXAMPLE

• Suppose that we have the following set of documents and terms,

and that we are interested in clustering the terms using the single
pass method (Note that the same method can be used to cluster the
documents, but in that case, we would be using the document
vectors (rows) rather than the term vector (columns).
SOLUTION

• Assume that our threshold is 10

• Start with T1 in a cluster by itself, say C1. At this point, C1
contains only one item, T1, so the centroid of C1 is simply the
vector for T1:

• C1 = <1, 3, 3, 2, 2>.

• Now compare (i.e., measure similarities) of the next item (T2) to

centroids of all existing clusters. At this point we have only one
cluster, C1 (we will use dot product for simplicity):

• SIM(T2, C1) = 12 + 13 + 03 + 12 + 2*2 = 11

SOLUTION CONT…..

• Now we need a pre-specified similarity threshold. This means that if

the similarity of T2 to the cluster centroid is >= 10, then we add T2 to
the cluster, otherwise we use T2 to start a new cluster.

• In this case. SIM(T2, C1) = 11 > 10. Therefore, we add T2 to cluster

C1.

• We now need to compute the new centroid for C1 (which now

contains T1 and T2). The centroid (which is the average vector for T1
and T2 is:

• C1 = <3/2, 4/2, 3/2, 3/2, 4/2>

SOLUTION CONT…..

• Now, we move to the next item, T3. Again, there is only one cluster, C1, so we only
need to compare T3 with C1 centroid. The dot product of T3 and

• the above centroid is:

• SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6

• This time, T3 does not pass the threshold test (the similarity is less than 10).
Therefore, we use T3 to start a new cluster, C2. Now we have two clusters

• C1 = {T1, T2}=<3/2, 4/2, 3/2, 3/2, 4/2>

• C2 = {T3}=<0,2, 0, 0, 1>
SOLUTION CONT…..

• We move to the next unclustered item, T4. Since we now have two
clusters, we need to compute the MAX similarity of T4 to the 2 cluster
centroids.

• (note that the centroid of cluster C2 right now is just the vector for T3):

• SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2>

• = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5

• SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1>

• = 0 + 6 + 0 + 0 + 5 = 11
SOLUTION CONT…..

• Note that both similarity scores pass the threshold (10), however, we pick the
MAX, and therefore, T4 will be added to cluster C1. Now we have the
following:

• C1 = {T1, T2, T4}

• C2 = {T3}

• The centroid for C2 is still just the vector for T3:

• C2 = <0, 2, 0, 0, 1>

• and the new centroid for C1 is now:

• C1 = <3/3, 7/3, 3/3, 6/3, 9/3>

SOLUTION CONT…..

• The only item left unclustered is T5. We compute its similarity to the centroids of
existing clusters:

• SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3>

• = 3/3 + 0 + 3/3 + 0 + 9/3 = 5

• SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1>

• = 0 + 0 + 0 + 0 +1 = 1

• Neither of these similarity values pass the threshold. Therefore, T5 will have to go
into a new cluster C3. There are no more unclustered items, so we are done (after
making a single pass through the items). The final clusters are:
SOLUTION CONT…..

• C1 = {T1, T2, T4}

• C2 = {T3}

• C3 = {T5}
DENDROGRAM

• A dendrogram is a diagram that shows the hierarchical

relationship between objects.
• It is mostly created as an output from hierarchical clustering.
• The main use of a dendrogram is to work out the best way to
allocate objects to clusters.
• The dendrogram below shows the hierarchical clustering of
six observations shown on the scatterplot to the left.
EXAMPLE
• C1 = {T1, T2, T4}

• C2 = {T3}

• C3 = {T5}

T1 T2 T4 T3 T5
EXAMPLE

Suppose that we have the following set of

documents and terms. We are interested in
clustering the terms by measuring similarity.
Assume, the prespecified similarity threshold is
10. Apply single pass clustering algorithm on the
following data to construct clusters.

Single Pass Clustering
No ratings yet
Single Pass Clustering
2 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
38 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
28 pages
Cluster Analysis Fundamentals
No ratings yet
Cluster Analysis Fundamentals
39 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
K Medoids
No ratings yet
K Medoids
101 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
AIMLB PGP 2025 Session 12
No ratings yet
AIMLB PGP 2025 Session 12
45 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Hierarchical Clustering Methods Explained
No ratings yet
Hierarchical Clustering Methods Explained
31 pages
DM 4
No ratings yet
DM 4
76 pages
Hierarchical Clustering Algorithm
No ratings yet
Hierarchical Clustering Algorithm
9 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
60 pages
Cluster
No ratings yet
Cluster
20 pages
Week 10
No ratings yet
Week 10
84 pages
ML - 8
No ratings yet
ML - 8
70 pages
UNIT5
No ratings yet
UNIT5
60 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Module-2 Part-1 - Merged
No ratings yet
Module-2 Part-1 - Merged
66 pages
Clustering
No ratings yet
Clustering
35 pages
Clustering
No ratings yet
Clustering
118 pages
Clustering
No ratings yet
Clustering
43 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Clustering Methods: K-Means & K-Medoids
No ratings yet
Clustering Methods: K-Means & K-Medoids
28 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
Week 6 AM Slides
No ratings yet
Week 6 AM Slides
39 pages
Lecture 7
No ratings yet
Lecture 7
48 pages
Agnes
No ratings yet
Agnes
25 pages
Unit 4 Cluster Analysis 3
No ratings yet
Unit 4 Cluster Analysis 3
20 pages
Data Mining: Clustering Techniques
No ratings yet
Data Mining: Clustering Techniques
53 pages
Clustering
No ratings yet
Clustering
80 pages
Cluster Analysis & Methods Guide
No ratings yet
Cluster Analysis & Methods Guide
11 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering Algorithms in Data Mining
No ratings yet
Clustering Algorithms in Data Mining
51 pages
Clustering
No ratings yet
Clustering
38 pages
Unit 4 ML
No ratings yet
Unit 4 ML
14 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
Dsbda 5
No ratings yet
Dsbda 5
13 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
UNIT4 Clustering
No ratings yet
UNIT4 Clustering
30 pages
Hierarchical Clustering Basics
No ratings yet
Hierarchical Clustering Basics
16 pages
Clustering
No ratings yet
Clustering
75 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Machine Learning Topic 4
No ratings yet
Machine Learning Topic 4
36 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Clustering - Introduction J Evaluation Metrics
No ratings yet
Clustering - Introduction J Evaluation Metrics
19 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
CS276A Text Retrieval and Mining
No ratings yet
CS276A Text Retrieval and Mining
48 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit 3
No ratings yet
Unit 3
12 pages
Week10 Summary Detail
No ratings yet
Week10 Summary Detail
16 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Assignment 15-16
No ratings yet
Assignment 15-16
9 pages
Maxima-Minima Problems in Calculus
No ratings yet
Maxima-Minima Problems in Calculus
21 pages
Sedona Effect: Geomagnetism & Brainwaves
100% (1)
Sedona Effect: Geomagnetism & Brainwaves
27 pages
Buayan National High School Video Project Rubric
No ratings yet
Buayan National High School Video Project Rubric
1 page
Technical Data for Engineers
No ratings yet
Technical Data for Engineers
1 page
Shams Dubai Initiative Contractor List
No ratings yet
Shams Dubai Initiative Contractor List
4 pages
Dec. 4 - Korean Literature
No ratings yet
Dec. 4 - Korean Literature
3 pages
Exp. Portal Frame (2 Experiments)
No ratings yet
Exp. Portal Frame (2 Experiments)
5 pages
MpSigStub.exe Update Log
No ratings yet
MpSigStub.exe Update Log
9 pages
Adaptive Resonance Theory Guide
No ratings yet
Adaptive Resonance Theory Guide
3 pages
Analytic Activism. Digital Listening and The New Political Strategy 1st Edition Karpf Download PDF
100% (2)
Analytic Activism. Digital Listening and The New Political Strategy 1st Edition Karpf Download PDF
55 pages
Comprehensive Copywriting Guide
100% (2)
Comprehensive Copywriting Guide
24 pages
Sterilization Biological Indicators
No ratings yet
Sterilization Biological Indicators
4 pages
Learning Needs for Nonprofits
No ratings yet
Learning Needs for Nonprofits
5 pages
Group Dynamics and Development Stages
No ratings yet
Group Dynamics and Development Stages
18 pages
Clifford Gertz Thick Description Toward An Interpretive Theory of Culture
100% (1)
Clifford Gertz Thick Description Toward An Interpretive Theory of Culture
3 pages
Exercise Sheet 1 Mathematics and Statistics
No ratings yet
Exercise Sheet 1 Mathematics and Statistics
9 pages
Error Detection / Correction: Computer Organization & Architecture
No ratings yet
Error Detection / Correction: Computer Organization & Architecture
18 pages
RPH Year 1 English Unit - Friends
No ratings yet
RPH Year 1 English Unit - Friends
14 pages
Critical Review of "The Study of Language"
100% (1)
Critical Review of "The Study of Language"
9 pages
VS5ICM M10 ResourceMonitoring
No ratings yet
VS5ICM M10 ResourceMonitoring
68 pages
Civil AutoCAD - CV
No ratings yet
Civil AutoCAD - CV
3 pages
Ecen 248 Lab 10 Report
100% (3)
Ecen 248 Lab 10 Report
4 pages
Bipolar Disorder in Children
No ratings yet
Bipolar Disorder in Children
8 pages
MBA HR Case Study Analysis
No ratings yet
MBA HR Case Study Analysis
16 pages
EIA For Maize & Wheat Milling Plant DEI PDF
100% (3)
EIA For Maize & Wheat Milling Plant DEI PDF
110 pages
End of Chapter 8 (p.606) Questions 1,2,4,8,14.: Short Answer
No ratings yet
End of Chapter 8 (p.606) Questions 1,2,4,8,14.: Short Answer
4 pages
Decimal to Hexadecimal Conversion Guide
No ratings yet
Decimal to Hexadecimal Conversion Guide
10 pages
Objectives:: How To Create Good First Impressions
No ratings yet
Objectives:: How To Create Good First Impressions
5 pages
PLSQL Study Material
No ratings yet
PLSQL Study Material
13 pages

Lecture-8-Single Pass Clustering Algorithm

Uploaded by

Lecture-8-Single Pass Clustering Algorithm

Uploaded by

SINGLE PASS

BTech VII Sem

• 2. For Di, calculate the similarity S with each cluster’s representative.

• 4. If an item Di remains to be clustered, return to step 2.

• Suppose that we have the following set of documents and terms,

• Assume that our threshold is 10

• Now compare (i.e., measure similarities) of the next item (T2) to

• SIM(T2, C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11

• Now we need a pre-specified similarity threshold. This means that if

• In this case. SIM(T2, C1) = 11 > 10. Therefore, we add T2 to cluster

• We now need to compute the new centroid for C1 (which now

• C1 = <3/2, 4/2, 3/2, 3/2, 4/2>

• the above centroid is:

• SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6

• C1 = {T1, T2}=<3/2, 4/2, 3/2, 3/2, 4/2>

• SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2>

• = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5

• SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1>

• C1 = {T1, T2, T4}

• The centroid for C2 is still just the vector for T3:

• and the new centroid for C1 is now:

• C1 = <3/3, 7/3, 3/3, 6/3, 9/3>

• SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3>

• = 3/3 + 0 + 3/3 + 0 + 9/3 = 5

• SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1>

• C1 = {T1, T2, T4}

• A dendrogram is a diagram that shows the hierarchical

Suppose that we have the following set of

You might also like

• SIM(T2, C1) = 12 + 13 + 03 + 12 + 2*2 = 11