0% found this document useful (0 votes)
14 views18 pages

Module 4

Uploaded by

highlevel941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

Module 4

Uploaded by

highlevel941
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 4: Clustering

1. Introduction to Clustering
●​ Clustering is a data mining technique used to group a set of objects into clusters such
that objects within the same cluster are more similar to each other than to objects in
other clusters.​

●​ It is an unsupervised learning method (no predefined class labels).​

●​ Objective: Discover natural groupings in data.​

2. Cluster
●​ A cluster is a collection of data objects that are similar to one another within the same
group and dissimilar to objects in other groups.​

●​ Example: Grouping customers based on purchasing behavior.​

3. Cluster Analysis
●​ Cluster Analysis = The process of grouping a set of objects into clusters.​

●​ It includes:​

1.​ Selecting features/attributes​

2.​ Measuring similarity/dissimilarity​

3.​ Choosing clustering algorithms (K-means, Hierarchical, DBSCAN, etc.)​

4.​ Evaluating results​


4. Applications of Cluster Analysis
●​ Marketing: Customer segmentation for targeted advertising.​

●​ Biology: Classification of plants, animals, genes.​

●​ Medicine: Grouping patients with similar symptoms.​

●​ Image Processing: Image segmentation.​

●​ Information Retrieval: Document or web page clustering.​

●​ Social Media: Community detection in networks.​

5. Requirements of Clustering in Data Mining


An effective clustering method should be:

1.​ Scalable – Should handle large datasets.​

2.​ Robust – Handle noise and outliers.​

3.​ Ability to deal with different data types – Numerical, categorical, mixed.​

4.​ Minimal domain knowledge required – Should not rely heavily on user input.​

5.​ Interpretability – Results should be meaningful.​

6. Difference between Classification and Clustering


Aspect Classification Clustering

Learning type Supervised learning Unsupervised learning

Class labels Predefined labels exist No predefined labels


Goal Assign objects to known Discover natural groupings
classes

Example Spam vs. Non-spam emails Grouping emails by topic

7. Types of Data in Cluster Analysis


Data for clustering can be represented in different structures:

a) Data Matrix

●​ Structure: n×p
●​ n = number of objects, p = number of variables/features​

●​ Each row = data object, Each column = attribute.​

●​ Example:​

Object Age Salary Spending


Score

O1 25 30K 65

O2 40 70K 40

b) Dissimilarity Matrix

●​ Structure: n×n
●​ Each entry d(i,j) represents the distance or dissimilarity between object i and j.​

●​ Example:

O1 O2 O3

O1 0 5 3

O2 5 0 4

O3 3 4 0

●​ ​
Measures: Euclidean, Manhattan, Cosine similarity, Jaccard, etc.
Clustering Methods
Clustering methods are broadly divided into categories depending on how they form clusters.

1. Partitioning Methods
●​ Divide the dataset into k clusters (user specifies k in advance).​

●​ Each object belongs to exactly one cluster.​

●​ Aim: Optimize a criterion function (e.g., minimize intra-cluster distance, maximize


inter-cluster distance).​

Examples:

●​ K-Means:​

○​ Iteratively assigns objects to the nearest cluster centroid.​

○​ Updates centroids until convergence.​

●​ K-Medoids (PAM):​

○​ Uses actual objects (medoids) as cluster centers instead of means.​

○​ More robust to noise/outliers than K-means.​

2. Hierarchical Methods
●​ Create a hierarchy (tree-like structure) of clusters → Dendrogram.​

●​ Two approaches:​

○​ Agglomerative (Bottom-up): Start with each object as a single cluster, then


merge closest pairs step by step.​

○​ Divisive (Top-down): Start with one big cluster and recursively split into smaller
clusters.​
Examples:

●​ AGNES (Agglomerative Nesting)​

●​ DIANA (Divisive Analysis)​

3. Density-Based Methods
●​ Clusters = dense regions of objects separated by low-density regions.​

●​ Good for arbitrary-shaped clusters and handling noise.​

Examples:

●​ DBSCAN (Density-Based Spatial Clustering of Applications with Noise):​

○​ Groups together points that are close in terms of density.​

○​ Identifies core, border, and noise points.​

●​ OPTICS (Ordering Points To Identify Clustering Structure):​

○​ Produces an ordering of points to capture clustering at different density levels.​

4. Grid-Based Methods
●​ Divide data space into a finite number of grid cells.​

●​ Clusters are formed based on dense cells.​

●​ Fast because processing depends on number of cells, not number of data points.​

Examples:

●​ STING (Statistical Information Grid)​


●​ CLIQUE (Clustering in Quest)​

5. Model-Based Methods
●​ Assume data is generated by a model, and clustering tries to best fit the data to that
model.​

●​ Often based on probabilistic models.​


Examples:
●​ EM (Expectation-Maximization): Fits mixture of Gaussian distributions.​

●​ COBWEB: Creates a classification tree.​

6. Constraint-Based Methods
●​ Clustering performed under certain constraints (user-specified conditions like must-link
or cannot-link between objects).​

●​ Useful when domain knowledge is available.​

Comparison Table of Clustering Methods


Method Approach Strengths Weakness

Partitioning K-means, Simple, efficient Need k, sensitive to noise


K-medoids

Hierarchical AGNES, DIANA Dendrogram, no need for k High computational cost

Density-based DBSCAN, Finds arbitrary shapes, Struggles with varying


OPTICS handles noise densities

Grid-based STING, CLIQUE Fast, scalable Depends on grid size

Model-based EM, COBWEB Statistical foundation Computationally


expensive

Constraint-bas COP-Kmeans Uses domain knowledge Complex to implement


ed
K-Means Clustering
1. Introduction
●​ K-Means is a partitioning clustering method.​

●​ It divides the dataset into k clusters, where k is given in advance.​

●​ Each cluster has a centroid (mean point) that represents it.​

●​ Objective: Minimize the sum of squared distances between data points and their
assigned cluster centroid.​

2. Steps of K-Means Algorithm


1.​ Choose k – the number of clusters.​

2.​ Initialize centroids – randomly select k data points as initial centroids.​

3.​ Assignment step: Assign each data point to the nearest centroid (using distance
measure, e.g., Euclidean).​

4.​ Update step: Recalculate centroids as the mean of all points assigned to each cluster.​

5.​ Repeat steps 3–4 until centroids no longer change (convergence).​

3. Example

4. Distance Measure (Euclidean Distance)


Used to assign data points to nearest centroid.
5. Advantages of K-Means
●​ Simple and easy to implement.​

●​ Works well with large datasets.​

●​ Converges quickly.​

6. Disadvantages of K-Means
●​ Must specify k in advance.​

●​ Sensitive to initial centroid selection.​

●​ Works best with spherical, evenly sized clusters.​

●​ Poor with noise and outliers.​

7. Applications of K-Means
●​ Market segmentation (group customers).​

●​ Document or text clustering.​

●​ Image compression (by reducing colors).​

●​ Pattern recognition.​
K-Medoids Clustering
●​ K-Medoids is a partitioning clustering algorithm similar to K-Means, but instead of
using the mean of data points as a cluster center, it uses medoids (actual data points).​

●​ A medoid is the most centrally located data point in a cluster (i.e., the one with minimum
dissimilarity to all others in that cluster).​

This makes K-Medoids more robust to noise and outliers compared to K-Means.

Key Characteristics:
1.​ Input: Number of clusters kkk.​

2.​ Cluster center: Medoid (actual object) instead of mean.​

3.​ Distance measure: Usually Euclidean distance, but can use any dissimilarity measure.​

4.​ Output: Partition of dataset into kkk clusters + set of kkk medoids.​

Algorithm (PAM – Partitioning Around Medoids):


1.​ Initialize: Randomly choose k medoids from the dataset.​

2.​ Assignment step: Assign each data point to the nearest medoid.​

3.​ Update step: For each cluster, find the most centrally located object (the medoid).​

4.​ Swap step: Try replacing a medoid with a non-medoid point and check if total cost (sum
of distances of points to their medoid) decreases.​

5.​ Repeat until no improvement.


Hierarchical Clustering
●​ Hierarchical clustering builds a hierarchy of clusters.​

●​ Two approaches:​

1.​ Agglomerative (Bottom–Up) → Start with each data point as its own cluster,
then merge them step by step.​

2.​ Divisive (Top–Down) → Start with all data points in one cluster, then split them.​

Agglomerative Hierarchical Clustering (AHC)


●​ Start: Each object is its own cluster.​

●​ Iteratively: Merge the two closest clusters based on a distance metric.​

●​ Stop: When only one cluster remains (or desired number of clusters is reached).​

📌 Algorithm Steps
1.​ Initialize: Treat each data point as a cluster.​

2.​ Compute Distance Matrix: Calculate pairwise distances between all clusters.​

3.​ Find Closest Clusters: Identify two clusters with minimum distance.​

4.​ Merge: Merge them into a single cluster.​

5.​ Update Distance Matrix: Recalculate distances between new cluster and remaining
clusters.​

6.​ Repeat: Continue until only one cluster remains or until a stopping condition (e.g., k
clusters).​

7.​ Dendrogram: Final result is usually visualized as a tree-like diagram.​


Linkage Criteria (How to define distance between
clusters?)
1.​ Single Linkage (MIN) → distance = shortest distance between any two points in
different clusters.​

2.​ Complete Linkage (MAX) → distance = farthest distance between any two points in
different clusters.​

3.​ Average Linkage → distance = average of all pairwise distances.


DBSCAN Clustering Algorithm

DBSCAN is a density-based clustering algorithm that groups together points that are closely
packed while marking points that are far away as outliers (noise).

How DBSCAN Works

●​ Defining Density (ε and MinPts)​

●​ Identifying Core, Border, and Noise Points​

Points Table:

Point X Y

P1 4.5 8

P2 5 7

P3 6 6.5

P4 7 5

P5 9 4

P6 7 3

P7 8 3.5

P8 9 5

P9 4 4

P10 3 7.5

P11 4 6

P12 3.5 5

Parameters:

●​ ε (epsilon) = 1.9​

●​ MinPts = 4
Step 1: Distance Matrix
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1 0.00 1.12 2.12 3.91 6.02 5.59 5.70 5.41 4.03 1.58 2.06 3.20

P2 1.12 0.00 1.12 2.83 5.00 4.47 4.61 4.47 3.16 2.06 1.41 2.50

P3 2.12 1.12 0.00 1.80 3.91 3.64 3.61 3.35 3.20 3.16 2.06 2.92

P4 3.91 2.83 1.80 0.00 2.24 2.00 1.80 2.00 3.16 4.72 3.16 2.92

P5 6.02 5.00 3.91 2.24 0.00 2.24 1.12 1.00 5.00 6.95 5.39 5.59

P6 5.59 4.47 3.64 2.00 2.24 0.00 1.12 1.80 3.16 6.02 4.24 4.03

P7 5.70 4.61 3.61 1.80 1.12 1.12 0.00 1.80 4.03 6.50 4.72 4.74

P8 5.41 4.47 3.35 2.00 1.00 1.80 1.80 0.00 5.10 6.30 5.10 5.12

P9 4.03 3.16 3.20 3.16 5.00 3.16 4.03 5.10 0.00 3.64 2.00 1.12

P10 1.58 2.06 3.16 4.72 6.95 6.02 6.50 6.30 3.64 0.00 1.80 2.55

P11 2.06 1.41 2.06 3.16 5.39 4.24 4.72 5.10 2.00 1.80 0.00 1.12

P12 3.20 2.50 2.92 2.92 5.59 4.03 4.74 5.12 1.12 2.55 1.12 0.00

Step 2: Neighborhood Identification (ε = 1.9)

●​ P1 → P2, P10​

●​ P2 → P1, P3, P11 → Cluster​

●​ P3 → P2, P4 → Noise​

●​ P4 → P3, P7 → Noise​

●​ P5 → P7, P8 → Noise​

●​ P6 → P7 → Noise​

●​ P7 → P4, P5, P6, P8 → Cluster​


●​ P8 → P5, P7 → Noise​

●​ P9 → P12 → Noise​

●​ P10 → P1, P11 → Noise​

●​ P11 → P2, P10,P12 → Cluster​

●​ P12 → P9, P10 → Noise​

Step 3: Identify Core, Border, Noise Points


Point Initial Status Final
Status

P1 Noise Border

P2 Cluster Cluster

P3 Noise Border

P4 Noise Border

P5 Noise Border

P6 Noise Border

P7 Cluster Cluster

P8 Noise Border

P9 Noise Noise

P10 Noise Border

P11 Cluster Cluster

P12 Noise Border


Step 4: Final Clusters

●​ Cluster C1: {P1, P2, P3, P11}​

●​ Cluster C2: {P4, P5, P6, P7, P8}​

●​ Cluster C3: {P2,P11,P10, P12}​

●​ Noise: {P9}​
Outliers
Definition
●​ An outlier is a data point that significantly deviates from the overall pattern of the dataset.​

●​ Outliers may represent errors/noise in data or rare but important events (e.g., fraud detection, medical
anomalies).​

Types of Outliers
1.​ Global Outliers (Point Anomalies)​

○​ Data instances far from the rest of the data.​

○​ Example: Fraudulent credit card transactions.​

2.​ Contextual Outliers (Conditional Anomalies)​

○​ Outliers only in a specific context.​

○​ Example: Temperature of 30°C is normal in summer but abnormal in winter.​

3.​ Collective Outliers​

○​ A group of related data points collectively deviate, though individual points may look normal.​

○​ Example: Sudden drop in stock prices across a sector.​


Challenges in Outlier Detection
●​ High-dimensional data makes distance and density measures less meaningful.​

●​ Distinguishing between true anomalies and rare but valid patterns.​

●​ Data imbalance: Outliers are usually very few compared to normal data.​

●​ Noise vs. outliers: Difficult to differentiate random noise from meaningful anomalies.​

●​ Scalability: Large datasets require efficient algorithms.​

Outlier Detection Methods


1. Supervised Methods
●​ Assume labels for normal vs. abnormal data are available.​

●​ Outlier detection = classification problem.​

●​ Techniques: Decision Trees, SVM, Neural Networks.​

●​ Challenge: Needs labeled outliers (rare in real life).​

2. Semi-Supervised Methods
●​ Train only on normal data (no labeled anomalies).​

●​ Model learns normal patterns; deviations are flagged as outliers.​

●​ Techniques: One-Class SVM, Autoencoders, Isolation Forest.​

3. Unsupervised Methods
●​ No labels; outliers detected based on statistical or structural properties.​

●​ Techniques:​

○​ Statistical tests (Z-score, Grubbs’ test, Boxplot/IQR method).​


○​ Density-based (LOF – Local Outlier Factor, DBSCAN).​

○​ Proximity-based (distance-based methods).​

4. Proximity-Based Methods
●​ Outliers detected by distance or density relative to neighbors.​

●​ Examples:​

○​ k-NN distance: If a point’s distance to its k nearest neighbors is large → outlier.​

○​ LOF (Local Outlier Factor): Compares local density of a point with its neighbors.​

5. Clustering-Based Methods
●​ Outliers are points that do not belong to any cluster or belong to small, sparse clusters.​

●​ Examples:​

○​ k-Means: Outliers are far from any cluster centroid.​

○​ DBSCAN: Points classified as Noise are outliers.​

○​ Hierarchical Clustering: Points that don’t merge well with clusters.​

Summary Table

Method Type Example Techniques Label Requirement

Supervised SVM, Decision Trees, Neural Nets Labeled normal + outlier

Semi-supervised One-Class SVM, Autoencoders Only normal data

Unsupervised Z-score, LOF, Isolation Forest No labels

Proximity-based k-NN distance, LOF No labels

Clustering-based k-Means, DBSCAN, Hierarchical No labels

You might also like