Workshop Project Report
Year of Submission: - 2023-24
Submit by,
Divyanshu Khandelwal_2115500055_3S_Class roll no. :- 22
Suryansh Agrawal_2115500147_3S_Class roll no. :- 42
Sonal Mittal_2115500140_3S_Class roll no. :- 40
Anshika Singh_2115500024_3S_Class roll no. :- 10
Department of Computer Engineering and Applications
GLA University, Mathura
Project Report: Customer Segmentation through Clustering
Analysis
Introduction:
Customer segmentation is a crucial aspect of marketing
strategies. Clustering algorithms aid in identifying patterns
within data to categorize customers into groups with similar
traits. This project utilizes two clustering algorithms—DBSCAN
and K-Means—to segment customers based on their
purchasing behavior.
Dataset:
The dataset used in this project contains transactional records
from a retail store. It includes attributes such as customer ID,
purchase history, frequency of purchases, and total amount
spent.
Methodology:
Data Preprocessing
1. Data Cleaning: Removing duplicates, handling missing
values, and ensuring data consistency.
2. Feature Selection: Choosing relevant attributes for
clustering, such as purchase frequency and total
spending.
3. Feature Scaling: Normalizing numerical features to ensure
uniformity.
Clustering Algorithms
1. DBSCAN (Density-Based Spatial Clustering of Applications
with Noise)
- DBSCAN identifies clusters based on density. It groups
together points that are closely packed.
- Parameters: Epsilon (ε) and Minimum Points (MinPts).
- Advantages: Robust to outliers and doesn’t require
specifying the number of clusters.
- Implementation: Using scikit-learn's DBSCAN algorithm.
2. K-Means Clustering
- K-Means partitions data into K clusters based on centroids'
proximity.
- Parameters: Number of clusters (K).
- Advantages: Simple, scalable, and efficient for large
datasets.
- Implementation: Utilizing scikit-learn's KMeans algorithm.
Model Building and Evaluation
DBSCAN Model
- Identified clusters based on varying epsilon values and
minimum points.
- Evaluated silhouette scores and visualized clusters using
scatter plots.
K-Means Model
- Explored different K values to find optimal clusters.
- Assessed the inertia scores and visualized clusters using
scatter plots.
Comparative Study
Performance Metrics
- Silhouette Score: Measures the compactness and separation
between clusters. Higher scores indicate better-defined
clusters.
-Inertia: Measures how internally coherent clusters are. Lower
values represent better clustering.
Results and Observations
- DBSCAN: Showed varying performance with different
parameter settings. Achieved silhouette score of X.
- K-Means: Found an optimal number of clusters (K) with
silhouette score of Y and inertia value of Z.
Conclusion
- Both algorithms effectively segmented customers based on
purchasing behavior.
- DBSCAN proved robust to outliers but required careful
parameter tuning.
- K-Means demonstrated simplicity and scalability, providing
well-defined clusters with optimal K values.
Recommendations
- For datasets with clear cluster densities, DBSCAN can be a
suitable choice.
- In scenarios where scalability and simplicity are vital, K-
Means can be preferred.
Future Work
- Experiment with other clustering algorithms like Hierarchical
Clustering or Gaussian Mixture Models.
- Incorporate additional features or external data sources for
more robust segmentation.
---
This report provides an overview of customer segmentation
using DBSCAN and K-Means algorithms, highlighting their
strengths, weaknesses, and comparative performance.