LAB Manual
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No.08
A.1 Aim:
Implementation of Agglomerative hierarchical clustering in any programming language like
JAVA, C++, Python or WEKA tool.
A.2 Prerequisite:
Familiarity with the WEKA tool and programming languages.
A.3 Outcome:
After successful completion of this experiment students will be able to
➢ Use classification and clustering algorithms of data mining.
A.4 Theory:
THEORY:
Hierarchical Clustering:-
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
One approach: recursive application of a partitional clustering algorithm.
Dendogram: Hierarchical Clustering
• Clustering obtained by cutting the dendrogram at a desired level: each connected
component forms a cluster.
Hierarchical Clustering algorithms:-
Agglomerative (bottom-up):
1. Start with each document being a single cluster.
2. Eventually all documents belong to the same cluster.
Divisive (top-down):
1.
2. Start with all documents belong to the same cluster.
3. Eventually each node forms a cluster on its own.
4. Does not require the number of clusters k in advance
5. Needs a termination/readout condition
6. The final mode in both Agglomerative and Divisive is of no use.
Dendogram: Hierarchical Clustering
Clustering obtained by cutting the dendrogram at a desired level: each
connected component forms a cluster.
Many variants to defining closest pair of clusters:-
Single-link: Similarity of the most cosine-similar (single-link)
Complete-link: Similarity of the “furthest” points, the least cosine-similar
Centroid : Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link: Average cosine between pairs of elements
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the
practical. The soft copy must be uploaded on the Blackboard or emailed to the concerned
lab in charge faculties at the end of the practical in case the there is no Black board access
available)
Roll. No. Name:
Class: Batch:
Date of Experiment: Date of Submission:
Grade:
B.1 Software Code written by student:
@relation bank_customers
@attribute age numeric
@attribute job {manager, developer, technician, analyst, retired,
accountant}
@attribute qualification {tertiary, primary, secondary}
@attribute communication_type {cellular, telephonic}
@attribute acc_balance numeric
@attribute marital_status {married, unmarried}
@data
32,manager,tertiary,cellular,30000,married
30,developer,secondary,telephonic,28000,married
40,manager,tertiary,cellular,40000,unmarried
70,retired,secondary,telephonic,52000,married
54,analyst,primary,cellular,32000,married
58,manager,tertiary,telephonic,60000,married
44,technician,secondary,cellular,40000,unmarried
35,manager,tertiary,telephonic,55000,married
42,technician,primary,telephonic,35000,married
28,accountant,secondary,cellular,32000,married
50,manager,tertiary,telephonic,70000,married
51,retired,primary,cellular,44000,unmarried
64,retired,secondary,cellular,30000,married
90,retired,tertiary,telephonic,85000,married
76,analyst,secondary,cellular,80000,unmarried
79,accountant,secondary,telephonic,72500,married
30,developer,primary,telephonic,50000,married
42,developer,secondary,telephonic,55000,married
B.2 Input and Output:
B.3 Observations and learning:
We observed that how data is preprocessed and cluster is implemented.
B.4 Conclusion:
We successfully implemented Agglomerative hierarchical clustering in WEKA tool.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and learning/observations)
Q1: Explain the advantages and disadvantages of agglomeration and hierarchical
clustering.
Ans:
Advantages
1) No apriori information about the number of clusters required.
2) Easy to implement and gives best result in some cases.
Disadvantages
1) Algorithm can never undo what was done previously.
2) Time complexity of at least O(n2 log n) is required, where ‘n’ is the number of data points.
3) Based on the type of distance matrix chosen for merging different algorithms can suffer
with one or more of the following:
i) Sensitivity to noise and outliers
ii) Breaking large clusters
iii) Difficulty handling different sized clusters and convex shapes
4) No objective function is directly minimized
5) Sometimes it is difficult to identify the correct number of clusters by the dendogram.
Q2: What is the relationship between top-down, bottom-up and division
/agglomeration?
Ans:
The top-down approach starts from a bulk material that incorporates critical nanoscale
details.
The bottom-up approach include self-assembly and molecular patterning.