Module 4
Module 4
1. Introduction to Clustering
● Clustering is a data mining technique used to group a set of objects into clusters such
that objects within the same cluster are more similar to each other than to objects in
other clusters.
2. Cluster
● A cluster is a collection of data objects that are similar to one another within the same
group and dissimilar to objects in other groups.
3. Cluster Analysis
● Cluster Analysis = The process of grouping a set of objects into clusters.
● It includes:
3. Ability to deal with different data types – Numerical, categorical, mixed.
4. Minimal domain knowledge required – Should not rely heavily on user input.
a) Data Matrix
● Structure: n×p
● n = number of objects, p = number of variables/features
● Example:
O1 25 30K 65
O2 40 70K 40
b) Dissimilarity Matrix
● Structure: n×n
● Each entry d(i,j) represents the distance or dissimilarity between object i and j.
● Example:
O1 O2 O3
O1 0 5 3
O2 5 0 4
O3 3 4 0
●
Measures: Euclidean, Manhattan, Cosine similarity, Jaccard, etc.
Clustering Methods
Clustering methods are broadly divided into categories depending on how they form clusters.
1. Partitioning Methods
● Divide the dataset into k clusters (user specifies k in advance).
Examples:
● K-Means:
● K-Medoids (PAM):
2. Hierarchical Methods
● Create a hierarchy (tree-like structure) of clusters → Dendrogram.
● Two approaches:
○ Divisive (Top-down): Start with one big cluster and recursively split into smaller
clusters.
Examples:
3. Density-Based Methods
● Clusters = dense regions of objects separated by low-density regions.
Examples:
4. Grid-Based Methods
● Divide data space into a finite number of grid cells.
● Fast because processing depends on number of cells, not number of data points.
Examples:
5. Model-Based Methods
● Assume data is generated by a model, and clustering tries to best fit the data to that
model.
6. Constraint-Based Methods
● Clustering performed under certain constraints (user-specified conditions like must-link
or cannot-link between objects).
● Objective: Minimize the sum of squared distances between data points and their
assigned cluster centroid.
3. Assignment step: Assign each data point to the nearest centroid (using distance
measure, e.g., Euclidean).
4. Update step: Recalculate centroids as the mean of all points assigned to each cluster.
3. Example
● Converges quickly.
6. Disadvantages of K-Means
● Must specify k in advance.
7. Applications of K-Means
● Market segmentation (group customers).
● Pattern recognition.
K-Medoids Clustering
● K-Medoids is a partitioning clustering algorithm similar to K-Means, but instead of
using the mean of data points as a cluster center, it uses medoids (actual data points).
● A medoid is the most centrally located data point in a cluster (i.e., the one with minimum
dissimilarity to all others in that cluster).
This makes K-Medoids more robust to noise and outliers compared to K-Means.
Key Characteristics:
1. Input: Number of clusters kkk.
3. Distance measure: Usually Euclidean distance, but can use any dissimilarity measure.
4. Output: Partition of dataset into kkk clusters + set of kkk medoids.
2. Assignment step: Assign each data point to the nearest medoid.
3. Update step: For each cluster, find the most centrally located object (the medoid).
4. Swap step: Try replacing a medoid with a non-medoid point and check if total cost (sum
of distances of points to their medoid) decreases.
● Two approaches:
1. Agglomerative (Bottom–Up) → Start with each data point as its own cluster,
then merge them step by step.
2. Divisive (Top–Down) → Start with all data points in one cluster, then split them.
● Stop: When only one cluster remains (or desired number of clusters is reached).
📌 Algorithm Steps
1. Initialize: Treat each data point as a cluster.
2. Compute Distance Matrix: Calculate pairwise distances between all clusters.
3. Find Closest Clusters: Identify two clusters with minimum distance.
5. Update Distance Matrix: Recalculate distances between new cluster and remaining
clusters.
6. Repeat: Continue until only one cluster remains or until a stopping condition (e.g., k
clusters).
2. Complete Linkage (MAX) → distance = farthest distance between any two points in
different clusters.
DBSCAN is a density-based clustering algorithm that groups together points that are closely
packed while marking points that are far away as outliers (noise).
Points Table:
Point X Y
P1 4.5 8
P2 5 7
P3 6 6.5
P4 7 5
P5 9 4
P6 7 3
P7 8 3.5
P8 9 5
P9 4 4
P10 3 7.5
P11 4 6
P12 3.5 5
Parameters:
● ε (epsilon) = 1.9
● MinPts = 4
Step 1: Distance Matrix
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
P1 0.00 1.12 2.12 3.91 6.02 5.59 5.70 5.41 4.03 1.58 2.06 3.20
P2 1.12 0.00 1.12 2.83 5.00 4.47 4.61 4.47 3.16 2.06 1.41 2.50
P3 2.12 1.12 0.00 1.80 3.91 3.64 3.61 3.35 3.20 3.16 2.06 2.92
P4 3.91 2.83 1.80 0.00 2.24 2.00 1.80 2.00 3.16 4.72 3.16 2.92
P5 6.02 5.00 3.91 2.24 0.00 2.24 1.12 1.00 5.00 6.95 5.39 5.59
P6 5.59 4.47 3.64 2.00 2.24 0.00 1.12 1.80 3.16 6.02 4.24 4.03
P7 5.70 4.61 3.61 1.80 1.12 1.12 0.00 1.80 4.03 6.50 4.72 4.74
P8 5.41 4.47 3.35 2.00 1.00 1.80 1.80 0.00 5.10 6.30 5.10 5.12
P9 4.03 3.16 3.20 3.16 5.00 3.16 4.03 5.10 0.00 3.64 2.00 1.12
P10 1.58 2.06 3.16 4.72 6.95 6.02 6.50 6.30 3.64 0.00 1.80 2.55
P11 2.06 1.41 2.06 3.16 5.39 4.24 4.72 5.10 2.00 1.80 0.00 1.12
P12 3.20 2.50 2.92 2.92 5.59 4.03 4.74 5.12 1.12 2.55 1.12 0.00
● P1 → P2, P10
● P3 → P2, P4 → Noise
● P4 → P3, P7 → Noise
● P5 → P7, P8 → Noise
● P6 → P7 → Noise
● P9 → P12 → Noise
P1 Noise Border
P2 Cluster Cluster
P3 Noise Border
P4 Noise Border
P5 Noise Border
P6 Noise Border
P7 Cluster Cluster
P8 Noise Border
P9 Noise Noise
● Noise: {P9}
Outliers
Definition
● An outlier is a data point that significantly deviates from the overall pattern of the dataset.
● Outliers may represent errors/noise in data or rare but important events (e.g., fraud detection, medical
anomalies).
Types of Outliers
1. Global Outliers (Point Anomalies)
○ A group of related data points collectively deviate, though individual points may look normal.
● Data imbalance: Outliers are usually very few compared to normal data.
● Noise vs. outliers: Difficult to differentiate random noise from meaningful anomalies.
2. Semi-Supervised Methods
● Train only on normal data (no labeled anomalies).
3. Unsupervised Methods
● No labels; outliers detected based on statistical or structural properties.
● Techniques:
4. Proximity-Based Methods
● Outliers detected by distance or density relative to neighbors.
● Examples:
○ LOF (Local Outlier Factor): Compares local density of a point with its neighbors.
5. Clustering-Based Methods
● Outliers are points that do not belong to any cluster or belong to small, sparse clusters.
● Examples:
Summary Table