0% found this document useful (0 votes)

14 views18 pages

Module 4

Uploaded by

highlevel941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views18 pages

Module 4

Uploaded by

highlevel941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Module 4: Clustering

1. Introduction to Clustering
● Clustering is a data mining technique used to group a set of objects into clusters such
that objects within the same cluster are more similar to each other than to objects in
other clusters.

● It is an unsupervised learning method (no predefined class labels).

● Objective: Discover natural groupings in data.

2. Cluster
● A cluster is a collection of data objects that are similar to one another within the same
group and dissimilar to objects in other groups.

● Example: Grouping customers based on purchasing behavior.

3. Cluster Analysis
● Cluster Analysis = The process of grouping a set of objects into clusters.

● It includes:

1. Selecting features/attributes

2. Measuring similarity/dissimilarity

3. Choosing clustering algorithms (K-means, Hierarchical, DBSCAN, etc.)

4. Evaluating results

4. Applications of Cluster Analysis
● Marketing: Customer segmentation for targeted advertising.

● Biology: Classification of plants, animals, genes.

● Medicine: Grouping patients with similar symptoms.

● Image Processing: Image segmentation.

● Information Retrieval: Document or web page clustering.

● Social Media: Community detection in networks.

5. Requirements of Clustering in Data Mining

An effective clustering method should be:

1. Scalable – Should handle large datasets.

2. Robust – Handle noise and outliers.

3. Ability to deal with different data types – Numerical, categorical, mixed.

4. Minimal domain knowledge required – Should not rely heavily on user input.

5. Interpretability – Results should be meaningful.

6. Difference between Classification and Clustering

Aspect Classification Clustering

Learning type Supervised learning Unsupervised learning

Class labels Predefined labels exist No predefined labels

Goal Assign objects to known Discover natural groupings
classes

Example Spam vs. Non-spam emails Grouping emails by topic

7. Types of Data in Cluster Analysis

Data for clustering can be represented in different structures:

a) Data Matrix

● Structure: n×p
● n = number of objects, p = number of variables/features

● Each row = data object, Each column = attribute.

● Example:

Object Age Salary Spending

Score

O1 25 30K 65

O2 40 70K 40

b) Dissimilarity Matrix

● Structure: n×n
● Each entry d(i,j) represents the distance or dissimilarity between object i and j.

● Example:

O1 O2 O3

O1 0 5 3

O2 5 0 4

O3 3 4 0

●
Measures: Euclidean, Manhattan, Cosine similarity, Jaccard, etc.
Clustering Methods
Clustering methods are broadly divided into categories depending on how they form clusters.

1. Partitioning Methods
● Divide the dataset into k clusters (user specifies k in advance).

● Each object belongs to exactly one cluster.

● Aim: Optimize a criterion function (e.g., minimize intra-cluster distance, maximize

inter-cluster distance).

Examples:

● K-Means:

○ Iteratively assigns objects to the nearest cluster centroid.

○ Updates centroids until convergence.

● K-Medoids (PAM):

○ Uses actual objects (medoids) as cluster centers instead of means.

○ More robust to noise/outliers than K-means.

2. Hierarchical Methods
● Create a hierarchy (tree-like structure) of clusters → Dendrogram.

● Two approaches:

○ Agglomerative (Bottom-up): Start with each object as a single cluster, then

merge closest pairs step by step.

○ Divisive (Top-down): Start with one big cluster and recursively split into smaller
clusters.
Examples:

● AGNES (Agglomerative Nesting)

● DIANA (Divisive Analysis)

3. Density-Based Methods
● Clusters = dense regions of objects separated by low-density regions.

● Good for arbitrary-shaped clusters and handling noise.

Examples:

● DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

○ Groups together points that are close in terms of density.

○ Identifies core, border, and noise points.

● OPTICS (Ordering Points To Identify Clustering Structure):

○ Produces an ordering of points to capture clustering at different density levels.

4. Grid-Based Methods
● Divide data space into a finite number of grid cells.

● Clusters are formed based on dense cells.

● Fast because processing depends on number of cells, not number of data points.

Examples:

● STING (Statistical Information Grid)

● CLIQUE (Clustering in Quest)

5. Model-Based Methods
● Assume data is generated by a model, and clustering tries to best fit the data to that
model.

● Often based on probabilistic models.

Examples:
● EM (Expectation-Maximization): Fits mixture of Gaussian distributions.

● COBWEB: Creates a classification tree.

6. Constraint-Based Methods
● Clustering performed under certain constraints (user-specified conditions like must-link
or cannot-link between objects).

● Useful when domain knowledge is available.

Comparison Table of Clustering Methods

Method Approach Strengths Weakness

Partitioning K-means, Simple, efficient Need k, sensitive to noise

K-medoids

Hierarchical AGNES, DIANA Dendrogram, no need for k High computational cost

Density-based DBSCAN, Finds arbitrary shapes, Struggles with varying

OPTICS handles noise densities

Grid-based STING, CLIQUE Fast, scalable Depends on grid size

Model-based EM, COBWEB Statistical foundation Computationally

expensive

Constraint-bas COP-Kmeans Uses domain knowledge Complex to implement

ed
K-Means Clustering
1. Introduction
● K-Means is a partitioning clustering method.

● It divides the dataset into k clusters, where k is given in advance.

● Each cluster has a centroid (mean point) that represents it.

● Objective: Minimize the sum of squared distances between data points and their
assigned cluster centroid.

2. Steps of K-Means Algorithm

1. Choose k – the number of clusters.

2. Initialize centroids – randomly select k data points as initial centroids.

3. Assignment step: Assign each data point to the nearest centroid (using distance
measure, e.g., Euclidean).

4. Update step: Recalculate centroids as the mean of all points assigned to each cluster.

5. Repeat steps 3–4 until centroids no longer change (convergence).

3. Example

4. Distance Measure (Euclidean Distance)

Used to assign data points to nearest centroid.
5. Advantages of K-Means
● Simple and easy to implement.

● Works well with large datasets.

● Converges quickly.

6. Disadvantages of K-Means
● Must specify k in advance.

● Sensitive to initial centroid selection.

● Works best with spherical, evenly sized clusters.

● Poor with noise and outliers.

7. Applications of K-Means
● Market segmentation (group customers).

● Document or text clustering.

● Image compression (by reducing colors).

● Pattern recognition.
K-Medoids Clustering
● K-Medoids is a partitioning clustering algorithm similar to K-Means, but instead of
using the mean of data points as a cluster center, it uses medoids (actual data points).

● A medoid is the most centrally located data point in a cluster (i.e., the one with minimum
dissimilarity to all others in that cluster).

This makes K-Medoids more robust to noise and outliers compared to K-Means.

Key Characteristics:
1. Input: Number of clusters kkk.

2. Cluster center: Medoid (actual object) instead of mean.

3. Distance measure: Usually Euclidean distance, but can use any dissimilarity measure.

4. Output: Partition of dataset into kkk clusters + set of kkk medoids.

Algorithm (PAM – Partitioning Around Medoids):

1. Initialize: Randomly choose k medoids from the dataset.

2. Assignment step: Assign each data point to the nearest medoid.

3. Update step: For each cluster, find the most centrally located object (the medoid).

4. Swap step: Try replacing a medoid with a non-medoid point and check if total cost (sum
of distances of points to their medoid) decreases.

5. Repeat until no improvement.

Hierarchical Clustering
● Hierarchical clustering builds a hierarchy of clusters.

● Two approaches:

1. Agglomerative (Bottom–Up) → Start with each data point as its own cluster,
then merge them step by step.

2. Divisive (Top–Down) → Start with all data points in one cluster, then split them.

Agglomerative Hierarchical Clustering (AHC)

● Start: Each object is its own cluster.

● Iteratively: Merge the two closest clusters based on a distance metric.

● Stop: When only one cluster remains (or desired number of clusters is reached).

📌 Algorithm Steps
1. Initialize: Treat each data point as a cluster.

2. Compute Distance Matrix: Calculate pairwise distances between all clusters.

3. Find Closest Clusters: Identify two clusters with minimum distance.

4. Merge: Merge them into a single cluster.

5. Update Distance Matrix: Recalculate distances between new cluster and remaining
clusters.

6. Repeat: Continue until only one cluster remains or until a stopping condition (e.g., k
clusters).

7. Dendrogram: Final result is usually visualized as a tree-like diagram.

Linkage Criteria (How to define distance between
clusters?)
1. Single Linkage (MIN) → distance = shortest distance between any two points in
different clusters.

2. Complete Linkage (MAX) → distance = farthest distance between any two points in
different clusters.

3. Average Linkage → distance = average of all pairwise distances.

DBSCAN Clustering Algorithm

DBSCAN is a density-based clustering algorithm that groups together points that are closely
packed while marking points that are far away as outliers (noise).

How DBSCAN Works

● Defining Density (ε and MinPts)

● Identifying Core, Border, and Noise Points

Points Table:

Point X Y

P1 4.5 8

P2 5 7

P3 6 6.5

P4 7 5

P5 9 4

P6 7 3

P7 8 3.5

P8 9 5

P9 4 4

P10 3 7.5

P11 4 6

P12 3.5 5

Parameters:

● ε (epsilon) = 1.9

● MinPts = 4
Step 1: Distance Matrix
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

P1 0.00 1.12 2.12 3.91 6.02 5.59 5.70 5.41 4.03 1.58 2.06 3.20

P2 1.12 0.00 1.12 2.83 5.00 4.47 4.61 4.47 3.16 2.06 1.41 2.50

P3 2.12 1.12 0.00 1.80 3.91 3.64 3.61 3.35 3.20 3.16 2.06 2.92

P4 3.91 2.83 1.80 0.00 2.24 2.00 1.80 2.00 3.16 4.72 3.16 2.92

P5 6.02 5.00 3.91 2.24 0.00 2.24 1.12 1.00 5.00 6.95 5.39 5.59

P6 5.59 4.47 3.64 2.00 2.24 0.00 1.12 1.80 3.16 6.02 4.24 4.03

P7 5.70 4.61 3.61 1.80 1.12 1.12 0.00 1.80 4.03 6.50 4.72 4.74

P8 5.41 4.47 3.35 2.00 1.00 1.80 1.80 0.00 5.10 6.30 5.10 5.12

P9 4.03 3.16 3.20 3.16 5.00 3.16 4.03 5.10 0.00 3.64 2.00 1.12

P10 1.58 2.06 3.16 4.72 6.95 6.02 6.50 6.30 3.64 0.00 1.80 2.55

P11 2.06 1.41 2.06 3.16 5.39 4.24 4.72 5.10 2.00 1.80 0.00 1.12

P12 3.20 2.50 2.92 2.92 5.59 4.03 4.74 5.12 1.12 2.55 1.12 0.00

Step 2: Neighborhood Identification (ε = 1.9)

● P1 → P2, P10

● P2 → P1, P3, P11 → Cluster

● P3 → P2, P4 → Noise

● P4 → P3, P7 → Noise

● P5 → P7, P8 → Noise

● P6 → P7 → Noise

● P7 → P4, P5, P6, P8 → Cluster

● P8 → P5, P7 → Noise

● P9 → P12 → Noise

● P10 → P1, P11 → Noise

● P11 → P2, P10,P12 → Cluster

● P12 → P9, P10 → Noise

Step 3: Identify Core, Border, Noise Points

Point Initial Status Final
Status

P1 Noise Border

P2 Cluster Cluster

P3 Noise Border

P4 Noise Border

P5 Noise Border

P6 Noise Border

P7 Cluster Cluster

P8 Noise Border

P9 Noise Noise

P10 Noise Border

P11 Cluster Cluster

P12 Noise Border

Step 4: Final Clusters

● Cluster C1: {P1, P2, P3, P11}

● Cluster C2: {P4, P5, P6, P7, P8}

● Cluster C3: {P2,P11,P10, P12}

● Noise: {P9}
Outliers
Definition
● An outlier is a data point that significantly deviates from the overall pattern of the dataset.

● Outliers may represent errors/noise in data or rare but important events (e.g., fraud detection, medical
anomalies).

Types of Outliers
1. Global Outliers (Point Anomalies)

○ Data instances far from the rest of the data.

○ Example: Fraudulent credit card transactions.

2. Contextual Outliers (Conditional Anomalies)

○ Outliers only in a specific context.

○ Example: Temperature of 30°C is normal in summer but abnormal in winter.

3. Collective Outliers

○ A group of related data points collectively deviate, though individual points may look normal.

○ Example: Sudden drop in stock prices across a sector.

Challenges in Outlier Detection
● High-dimensional data makes distance and density measures less meaningful.

● Distinguishing between true anomalies and rare but valid patterns.

● Data imbalance: Outliers are usually very few compared to normal data.

● Noise vs. outliers: Difficult to differentiate random noise from meaningful anomalies.

● Scalability: Large datasets require efficient algorithms.

Outlier Detection Methods

1. Supervised Methods
● Assume labels for normal vs. abnormal data are available.

● Outlier detection = classification problem.

● Techniques: Decision Trees, SVM, Neural Networks.

● Challenge: Needs labeled outliers (rare in real life).

2. Semi-Supervised Methods
● Train only on normal data (no labeled anomalies).

● Model learns normal patterns; deviations are flagged as outliers.

● Techniques: One-Class SVM, Autoencoders, Isolation Forest.

3. Unsupervised Methods
● No labels; outliers detected based on statistical or structural properties.

● Techniques:

○ Statistical tests (Z-score, Grubbs’ test, Boxplot/IQR method).

○ Density-based (LOF – Local Outlier Factor, DBSCAN).

○ Proximity-based (distance-based methods).

4. Proximity-Based Methods
● Outliers detected by distance or density relative to neighbors.

● Examples:

○ k-NN distance: If a point’s distance to its k nearest neighbors is large → outlier.

○ LOF (Local Outlier Factor): Compares local density of a point with its neighbors.

5. Clustering-Based Methods
● Outliers are points that do not belong to any cluster or belong to small, sparse clusters.

● Examples:

○ k-Means: Outliers are far from any cluster centroid.

○ DBSCAN: Points classified as Noise are outliers.

○ Hierarchical Clustering: Points that don’t merge well with clusters.

Summary Table

Method Type Example Techniques Label Requirement

Supervised SVM, Decision Trees, Neural Nets Labeled normal + outlier

Semi-supervised One-Class SVM, Autoencoders Only normal data

Unsupervised Z-score, LOF, Isolation Forest No labels

Proximity-based k-NN distance, LOF No labels

Clustering-based k-Means, DBSCAN, Hierarchical No labels

Linear Programming for Analysts
No ratings yet
Linear Programming for Analysts
37 pages
2ndQ Exam-Math 10 Sample
100% (1)
2ndQ Exam-Math 10 Sample
2 pages
Slides Bayesian Optimization 1 Black Box
No ratings yet
Slides Bayesian Optimization 1 Black Box
9 pages
4 Polynomials
No ratings yet
4 Polynomials
17 pages
Graph Basics for Beginners
No ratings yet
Graph Basics for Beginners
89 pages
Q2 - Illustrating and Differentiating Related Terms in Algebra
50% (2)
Q2 - Illustrating and Differentiating Related Terms in Algebra
10 pages
Topic Wise Test Polynomials Cbse Class 9 Maths: Verify Division Algorithm For The P (X) X X
No ratings yet
Topic Wise Test Polynomials Cbse Class 9 Maths: Verify Division Algorithm For The P (X) X X
1 page
Calculus: Partial Fractions Guide
No ratings yet
Calculus: Partial Fractions Guide
41 pages
Neural Net
No ratings yet
Neural Net
15 pages
Konch, De& Paul Unit1 & 2 Mathmatics II
No ratings yet
Konch, De& Paul Unit1 & 2 Mathmatics II
173 pages
MCQ Unit 4 PDF
No ratings yet
MCQ Unit 4 PDF
10 pages
4-Depth Limit Search and Iterative Deepening search-25-Jul-2020Material - I - 25-Jul-2020 - Module - 2 - Problem - Solving
No ratings yet
4-Depth Limit Search and Iterative Deepening search-25-Jul-2020Material - I - 25-Jul-2020 - Module - 2 - Problem - Solving
34 pages
BFS and DFS
No ratings yet
BFS and DFS
6 pages
Integrating Rational Functions: Partial Fractions
No ratings yet
Integrating Rational Functions: Partial Fractions
6 pages
This Study Resource Was: Exercises
No ratings yet
This Study Resource Was: Exercises
1 page
Advanced Data Structures Exam Paper
No ratings yet
Advanced Data Structures Exam Paper
3 pages
Understanding Divide and Conquer Strategy
No ratings yet
Understanding Divide and Conquer Strategy
8 pages
Example Dijkstras Algorithm
No ratings yet
Example Dijkstras Algorithm
9 pages
Lecture 3 - Informed Search
No ratings yet
Lecture 3 - Informed Search
54 pages
ADA Lab Viva Questions
100% (1)
ADA Lab Viva Questions
12 pages
MATH-351 - Numerical Methods
No ratings yet
MATH-351 - Numerical Methods
3 pages
Unit 4 Graph Algorithms-Revised
No ratings yet
Unit 4 Graph Algorithms-Revised
51 pages
Extra Practice On Limit
No ratings yet
Extra Practice On Limit
23 pages
R21 ANN Assignment I B
No ratings yet
R21 ANN Assignment I B
2 pages
IX CLASS - KAT - MAT - Quadratice Equations - WS-1 - Sol
No ratings yet
IX CLASS - KAT - MAT - Quadratice Equations - WS-1 - Sol
8 pages
Werq
No ratings yet
Werq
12 pages
Class 10 Chapter 2 Maths Important Question
0% (1)
Class 10 Chapter 2 Maths Important Question
10 pages
Grade 9 Polynomial Guide
No ratings yet
Grade 9 Polynomial Guide
27 pages
Learning Plan for Machine Learning Course
No ratings yet
Learning Plan for Machine Learning Course
6 pages
Assignment 6
No ratings yet
Assignment 6
5 pages

Module 4

Uploaded by

Module 4

Uploaded by

Module 4: Clustering

●​ It is an unsupervised learning method (no predefined class labels).​

●​ Objective: Discover natural groupings in data.​

●​ Example: Grouping customers based on purchasing behavior.​

1.​ Selecting features/attributes​

2.​ Measuring similarity/dissimilarity​

3.​ Choosing clustering algorithms (K-means, Hierarchical, DBSCAN, etc.)​

4.​ Evaluating results​

●​ Biology: Classification of plants, animals, genes.​

●​ Medicine: Grouping patients with similar symptoms.​

●​ Image Processing: Image segmentation.​

●​ Information Retrieval: Document or web page clustering.​

●​ Social Media: Community detection in networks.​

5. Requirements of Clustering in Data Mining

1.​ Scalable – Should handle large datasets.​

2.​ Robust – Handle noise and outliers.​

5.​ Interpretability – Results should be meaningful.​

6. Difference between Classification and Clustering

Learning type Supervised learning Unsupervised learning

Class labels Predefined labels exist No predefined labels

Example Spam vs. Non-spam emails Grouping emails by topic

7. Types of Data in Cluster Analysis

●​ Each row = data object, Each column = attribute.​

Object Age Salary Spending

●​ Each object belongs to exactly one cluster.​

●​ Aim: Optimize a criterion function (e.g., minimize intra-cluster distance, maximize

○​ Iteratively assigns objects to the nearest cluster centroid.​

○​ Updates centroids until convergence.​

○​ Uses actual objects (medoids) as cluster centers instead of means.​

○​ More robust to noise/outliers than K-means.​

○​ Agglomerative (Bottom-up): Start with each object as a single cluster, then

●​ AGNES (Agglomerative Nesting)​

●​ DIANA (Divisive Analysis)​

●​ Good for arbitrary-shaped clusters and handling noise.​

●​ DBSCAN (Density-Based Spatial Clustering of Applications with Noise):​

○​ Groups together points that are close in terms of density.​

○​ Identifies core, border, and noise points.​

●​ OPTICS (Ordering Points To Identify Clustering Structure):​

○​ Produces an ordering of points to capture clustering at different density levels.​

●​ Clusters are formed based on dense cells.​

●​ STING (Statistical Information Grid)​

●​ Often based on probabilistic models.​

●​ COBWEB: Creates a classification tree.​

●​ Useful when domain knowledge is available.​

Comparison Table of Clustering Methods

Partitioning K-means, Simple, efficient Need k, sensitive to noise

Hierarchical AGNES, DIANA Dendrogram, no need for k High computational cost

Density-based DBSCAN, Finds arbitrary shapes, Struggles with varying

Grid-based STING, CLIQUE Fast, scalable Depends on grid size

Model-based EM, COBWEB Statistical foundation Computationally

Constraint-bas COP-Kmeans Uses domain knowledge Complex to implement

●​ It divides the dataset into k clusters, where k is given in advance.​

●​ Each cluster has a centroid (mean point) that represents it.​

2. Steps of K-Means Algorithm

2.​ Initialize centroids – randomly select k data points as initial centroids.​

5.​ Repeat steps 3–4 until centroids no longer change (convergence).​

4. Distance Measure (Euclidean Distance)

●​ Works well with large datasets.​

●​ Sensitive to initial centroid selection.​

●​ Works best with spherical, evenly sized clusters.​

●​ Poor with noise and outliers.​

●​ Document or text clustering.​

●​ Image compression (by reducing colors).​

2.​ Cluster center: Medoid (actual object) instead of mean.​

Algorithm (PAM – Partitioning Around Medoids):

5.​ Repeat until no improvement.

Agglomerative Hierarchical Clustering (AHC)

●​ Iteratively: Merge the two closest clusters based on a distance metric.​

4.​ Merge: Merge them into a single cluster.​

7.​ Dendrogram: Final result is usually visualized as a tree-like diagram.​

3.​ Average Linkage → distance = average of all pairwise distances.

How DBSCAN Works

●​ Defining Density (ε and MinPts)​

●​ Identifying Core, Border, and Noise Points​

Step 2: Neighborhood Identification (ε = 1.9)

●​ P2 → P1, P3, P11 → Cluster​

● It is an unsupervised learning method (no predefined class labels).

● Objective: Discover natural groupings in data.

● Example: Grouping customers based on purchasing behavior.

1. Selecting features/attributes

2. Measuring similarity/dissimilarity

3. Choosing clustering algorithms (K-means, Hierarchical, DBSCAN, etc.)

4. Evaluating results

● Biology: Classification of plants, animals, genes.

● Medicine: Grouping patients with similar symptoms.

● Image Processing: Image segmentation.

● Information Retrieval: Document or web page clustering.

● Social Media: Community detection in networks.

1. Scalable – Should handle large datasets.

2. Robust – Handle noise and outliers.

5. Interpretability – Results should be meaningful.

● Each row = data object, Each column = attribute.

● Each object belongs to exactly one cluster.

● Aim: Optimize a criterion function (e.g., minimize intra-cluster distance, maximize

○ Iteratively assigns objects to the nearest cluster centroid.

○ Updates centroids until convergence.

○ Uses actual objects (medoids) as cluster centers instead of means.

○ More robust to noise/outliers than K-means.

○ Agglomerative (Bottom-up): Start with each object as a single cluster, then

● AGNES (Agglomerative Nesting)

● DIANA (Divisive Analysis)

● Good for arbitrary-shaped clusters and handling noise.

● DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

○ Groups together points that are close in terms of density.

○ Identifies core, border, and noise points.

● OPTICS (Ordering Points To Identify Clustering Structure):

○ Produces an ordering of points to capture clustering at different density levels.

● Clusters are formed based on dense cells.

● STING (Statistical Information Grid)

● Often based on probabilistic models.

● COBWEB: Creates a classification tree.

● Useful when domain knowledge is available.

● It divides the dataset into k clusters, where k is given in advance.

● Each cluster has a centroid (mean point) that represents it.

2. Initialize centroids – randomly select k data points as initial centroids.

5. Repeat steps 3–4 until centroids no longer change (convergence).

● Works well with large datasets.

● Sensitive to initial centroid selection.

● Works best with spherical, evenly sized clusters.

● Poor with noise and outliers.

● Document or text clustering.

● Image compression (by reducing colors).

2. Cluster center: Medoid (actual object) instead of mean.

5. Repeat until no improvement.

● Iteratively: Merge the two closest clusters based on a distance metric.

4. Merge: Merge them into a single cluster.

7. Dendrogram: Final result is usually visualized as a tree-like diagram.

3. Average Linkage → distance = average of all pairwise distances.

● Defining Density (ε and MinPts)

● Identifying Core, Border, and Noise Points

● P2 → P1, P3, P11 → Cluster

● P7 → P4, P5, P6, P8 → Cluster

● P10 → P1, P11 → Noise

● P11 → P2, P10,P12 → Cluster

● P12 → P9, P10 → Noise

● Cluster C1: {P1, P2, P3, P11}

● Cluster C2: {P4, P5, P6, P7, P8}

● Cluster C3: {P2,P11,P10, P12}

○ Data instances far from the rest of the data.

○ Example: Fraudulent credit card transactions.

2. Contextual Outliers (Conditional Anomalies)

○ Outliers only in a specific context.

○ Example: Temperature of 30°C is normal in summer but abnormal in winter.

3. Collective Outliers

○ Example: Sudden drop in stock prices across a sector.

● Distinguishing between true anomalies and rare but valid patterns.

● Scalability: Large datasets require efficient algorithms.

● Outlier detection = classification problem.

● Techniques: Decision Trees, SVM, Neural Networks.

● Challenge: Needs labeled outliers (rare in real life).

● Model learns normal patterns; deviations are flagged as outliers.

● Techniques: One-Class SVM, Autoencoders, Isolation Forest.

○ Statistical tests (Z-score, Grubbs’ test, Boxplot/IQR method).

○ Proximity-based (distance-based methods).

○ k-NN distance: If a point’s distance to its k nearest neighbors is large → outlier.

○ k-Means: Outliers are far from any cluster centroid.