Fyp CS 2023 TWX
Fyp CS 2023 TWX
A REPORT
SUBMITTED TO
Universiti Tunku Abdul Rahman
in partial fulfillment of the requirements
for the degree of
BACHELOR OF COMPUTER SCIENCE (HONOURS)
Faculty of Information and Communication Technology
(Kampar Campus)
JAN 2023
UNIVERSITI TUNKU ABDUL RAHMAN
Verified by,
_________________________ _________________________
(Author’s signature) (Supervisor’s signature)
Address:
7, Laluan Keledang Utara 2,
Taman Wang, _______Lim Jia Qi_________
31450 Menglembu, Perak Supervisor’s name
Date: __1/April/2023____
It is hereby certified that Toh Wei Xuan (ID No: 19ACB03568) has completed this final year
project/ dissertation/ thesis* entitled “_Customer Segmentation on Clustering Algorithms_” under
the supervision of _Dr Lim Jia Qi_ (Supervisor) from the Department of Computer Science, Faculty
of Information and Communication Technology, and Dr Kh'ng Xin Yi (Co-Supervisor)* from the
Department of Computer Science, Faculty of Information and Communication Technology.
I understand that University will upload softcopy of my final year project / dissertation/ thesis* in
pdf format into UTAR Institutional Repository, which may be made accessible to UTAR
community and public.
Yours truly,
____________________
Toh Wei Xuan
Signature : _________________________
Their continuous support and advice have been instrumental in overcoming the challenges I
faced during this project. Their unwavering belief in my abilities and willingness to share
their knowledge and expertise have been invaluable in shaping my understanding of the
subject matter.
Once again, I would like to express my heartfelt thanks to my supervisor and moderator for
their golden opportunity, and for the immense impact they have had on my academic and
professional growth. I will forever cherish the experience and knowledge gained under their
mentorship.
Firstly, descriptive analysis is performed to explore the characteristics of the dataset. Then, k-
means, DBSCAN, and GMM clustering algorithms are applied to segment customers based
on their buying behaviour. Finally, RFM (Recency, Frequency, Monetary) analysis is used to
segment customers based on their purchasing history.
The results show that all clustering algorithms were able to identify distinct customer groups
with varying characteristics. Furthermore, the RFM analysis was able to segment customers
based on their buying patterns, and provide insights into their behaviour.
Overall, the study demonstrates the effectiveness of different clustering algorithms and RFM
analysis in identifying customer segments. The insights gained from this study could
potentially be used by the e-commerce company to improve their marketing strategies and
customer engagement.
DECLARATION OF ORIGINALITY__________________________________________ iv
ACKNOWLEDGEMENTS ___________________________________________________ v
ABSTRACT ______________________________________________________________ vi
1 Introduction___________________________________________________________ 1
1.6 Contributions_____________________________________________________ 10
3.2 Performance Metric (clarify internal and why not external) ______________ 24
4 Experiment/Simulation _________________________________________________ 38
REFERENCES ___________________________________________________________ 64
Appendix _________________________________________________________________ 67
POSTER _________________________________________________________________ 96
Cluster analysis is a popular technique for customer segmentation, and it involves grouping
customers based on their similarity in terms of certain attributes. There are several clustering
algorithms that can be used for this purpose, including k-means, DBSCAN, and GMM. Each
algorithm has its own strengths and weaknesses, and the choice of algorithm depends on the
nature of the data and the research question.
RFM analysis is another popular technique for customer segmentation, and it involves
analyzing customers based on their purchasing behavior. The analysis involves three metrics:
recency, frequency, and monetary value, which are used to segment customers into different
groups based on their purchasing patterns.
In this report, we apply several clustering algorithms and RFM analysis to segment customers
of an e-commerce company based on their transactional and demographic data. The aim of
the study is to identify distinct customer groups with unique characteristics, and to provide
insights into their behavior. The results of this study could potentially be used by the e-
commerce company to improve their marketing strategies and customer engagement [7].
Status of user
Benefits desired
Readiness to buy
Psychographic Opinions
Attitudes
Activities
Values
Interest
Geographic Climate
Religion
Area Size
Population density
Table 1.1 Types of Customer Segmentation
Therefore, a mix of attributes from different dimensions of segmentation is essential for
developing a robust customer segmentation model. Overall, the ultimate purposes of
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
3
customer segmentation are to maximize customer value to the company, optimize marketing
strategies, and improve customer experience.
1.1.2 Clustering
Clustering is a form of unsupervised learning strategy in the field of machine learning.
Unsupervised learning approach aims to discover intrinsic data patterns and actionable
insights from data sets that do not have a labelled output variable. It’s a type of exploratory
data analysis which involves grouping data sets into a specified number of clusters with
comparable features among the data points inside each cluster as shown in Figure 1.2 [9].
Clustering, also known as cluster analysis, is a vital data analysis technique and a field
within data mining. It involves grouping the dataset into clusters, where data points with
similar attributes are placed in the same group or cluster. The aim is to identify connections
between data points based on their raw data qualities. However, the main challenge is to
determine the appropriate number of clusters that are relevant and informative for analysis.
This is a systematic and iterative process that involves analyzing large volumes of raw data
for patterns and commonalities. The disorganized data is sifted for meaningful insights, and
then data points are assigned to clusters. To achieve good results in a market domain, a
particular clustering technique coupled with detailed examination of cluster properties based
on the clustering results may be ideal.
Effective customer segmentation is crucial for businesses seeking to maximize profits and
improve customer experiences. In this study, some alternatives were explored when it comes
to feature engineering, identifying significant variables, clustering algorithms implementation
and comparison and cluster interpretation. These steps are vital in the customer segmentation
pipeline. Addressing these challenges is essential for businesses to obtain accurate and
meaningful customer segmentation results, which can inform targeted marketing strategies,
improve customer engagement, and ultimately drive revenue growth [13].
Effective customer segmentation is a critical task for businesses, but selecting the appropriate
variables, determining the number of variables to consider, and determining the appropriate
sample size can be challenging. Improper selection of features, including irrelevant or
redundant features, can lead to inaccurate or meaningless cluster solutions, while an
inadequate sample size can lead to biased or incomplete results. Thus, there is a need to
identify best practices and guidelines for selecting variables, determining the number of
variables to consider, and determining the appropriate sample size to ensure accurate and
meaningful customer segmentation [15].
Data cleaning and preprocessing are critical steps in preparing data for analysis. However,
many organizations face challenges in these areas, including dealing with missing values,
inconsistent formatting, duplicate entries, and outliers. Additionally, selecting the appropriate
preprocessing techniques can be difficult, as different methods may have varying effects on
the quality and accuracy of the resulting analysis. As a result, there is a need for effective
data cleaning and preprocessing strategies that can ensure data integrity and optimize data for
analysis [13].
Clustering is an essential technique for customer segmentation and market analysis, but
selecting the appropriate clustering algorithm is a complex task. Marketers and data analysts
must choose from various clustering algorithms, each with its own strengths and weaknesses,
and evaluate their suitability for the specific data set and business problem at hand. The
Determining the optimal number of clusters is a crucial step in clustering analysis. However,
it is a challenging task as it requires selecting the right method for calculating the number of
clusters and interpreting the results. The choice of the wrong method can lead to incorrect
conclusions, which can have significant consequences. Therefore, there is a need for a
systematic and reliable method for calculating the optimal number of clusters that takes into
account the characteristics of the data set and the clustering algorithm used. The problem is to
develop a method that can accurately determine the optimal number of clusters for a given
data set and clustering algorithm, considering the sample size, the number of variables, and
other relevant factors [15].
Effective validation of clustering results is essential to ensure that the clusters obtained from
the data are meaningful and useful. However, determining the optimal validation test for a
given clustering problem is challenging, as there are various validation measures and
methods available, and each has its own strengths and weaknesses. Furthermore, the
effectiveness of the chosen validation test may depend on the type of clustering algorithm
used, the characteristics of the data, and the desired outcome of the clustering analysis.
Therefore, identifying the appropriate validation test to use in each clustering problem is
crucial to ensure the reliability and validity of the results.
1.3 Motivation
Customer segmentation is a crucial aspect of any business that aims to better understand its
customers and tailor its products or services to meet their needs. However, selecting the
appropriate clustering algorithm for customer segmentation is a challenging task that requires
a thorough understanding of the strengths and weaknesses of different techniques. This report
aims to provide a comprehensive analysis of four popular clustering algorithms, namely K-
Means, DBSCAN, GMM, and RFM+k-means, in the context of customer segmentation. By
comparing the performance of these algorithms on a real-world dataset, this report will
provide valuable insights into their effectiveness in identifying meaningful customer
segments. Such insights can help businesses make more informed decisions about how to
allocate their resources and develop targeted marketing strategies that can improve customer
satisfaction and loyalty.
1.4 Objective
a) To compare the effectiveness of k-means, DBSCAN, GMM, and RFM clustering
algorithms in segmenting customers based on their purchasing behavior.
b) To determine the optimal number of clusters for each algorithm using internal
validation measures such as silhouette, AIC, BIC, Calinski-Harabaz, and Davies-
Boudin.
c) To interpret and describe the resulting clusters in terms of customer characteristics,
preferences, and behaviors.
d) To provide recommendations for marketers and business owners on how to use
clustering analysis to improve customer targeting and retention strategies.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
8
1.5 Project Scope and Direction
The scope of this project is to explore and compare four popular clustering algorithms for
customer segmentation: K-means, DBSCAN, GMM, and RFM. The project will involve
conducting a comprehensive literature review on each algorithm and its applications in
customer segmentation. The project will also involve implementing each algorithm on a
dataset of customer transactions and comparing the results obtained.
The direction of the project will be as follows:
a) Conduct a literature review of K-means, DBSCAN, GMM, and RFM algorithms and
their applications in customer segmentation.
b) Pre-process the customer transaction data, including data cleaning, missing value
imputation, and data normalization.
c) Implement each algorithm on the pre-processed dataset and obtain the corresponding
clusters.
d) Evaluate and compare the results obtained from each algorithm in terms of cluster
quality metrics such as Silhouette score, Dunn index, and Davies-Bouldin index.
e) Perform a detailed analysis of the cluster characteristics and interpret the results
obtained from each algorithm.
f) Discuss the strengths and weaknesses of each algorithm and provide
recommendations for their appropriate use in customer segmentation.
g) Summarize the findings of the project and provide future research directions.
h) Develop a user-friendly GUI using MiniBatchKmeans
The project aims to provide insights into the strengths and weaknesses of each algorithm and
their suitability for customer segmentation. The project will be useful for marketers and
business analysts who are interested in using clustering algorithms for customer
segmentation.
Furthermore, this report will provide practical guidance on selecting relevant variables,
determining optimal cluster numbers, validating cluster solutions, and interpreting cluster
descriptions. By bridging the gap between theoretical concepts and practical applications, this
report offers a valuable resource to those interested in leveraging customer segmentation and
clustering algorithms to improve customer targeting, enhance marketing campaigns, and
optimize overall business performance.
Overall, this study provides a comprehensive analysis and comparison of five clustering
algorithms, including MiniBatch K-means, offering valuable insights and practical guidance
for the field of customer segmentation.
This report is organized into six chapters, each addressing different aspects of the project.
Chapter 1, the Introduction, provides a comprehensive overview of the project. It covers the
background information, problem statement, project motivation, scope, objectives, project
contribution, highlights of project achievements, and the overall organization of the report.
This chapter sets the stage for the subsequent chapters, establishing the context and purpose
of the study.
Chapter 2 focuses on the Literature Review, where an extensive analysis of existing customer
algorithm techniques in the market is conducted. The aim is to evaluate and compare the
strengths and weaknesses of these techniques. This chapter provides valuable insights into the
current state of the field and serves as a foundation for the subsequent chapters.
In Chapter 3, the Methodology, the overall model design and methods employed in the
project are discussed. This chapter offers a detailed explanation of the chosen approach,
providing readers with a clear understanding of the methodologies and techniques utilized.
Chapter 4, the Experiment/Simulation chapter, centers around the hardware used and the
development of the graphical user interface (GUI) for the project. It outlines the technical
aspects of the implementation, highlighting the tools, equipment, and software utilized in the
experiments or simulations conducted.
Chapter 5 delves into Model Evaluation and Discussion. This chapter focuses on the analysis
of clusters and the examination of the clustering output. It involves a thorough evaluation and
discussion of the results obtained, allowing for a comprehensive assessment of the
performance and effectiveness of the clustering algorithms.
Finally, Chapter 6, the Conclusion and Recommendations chapter, provides a summary of the
project's findings. It also offers recommendations for future work and identifies potential
areas for further exploration and improvement in the field of customer segmentation using
clustering algorithms.
Customer segmentation has been a vital area of research in marketing and business analytics,
as it helps to identify different groups of customers with similar characteristics and behaviors
and create targeted marketing strategies to meet their specific needs. In this chapter, we
review previous works on customer segmentation techniques in the literature.
One of the most used customer segmentation techniques is the RFM (Recency, Frequency,
Monetary value) method, which is based on the principle that customers who have made
recent purchases, frequent purchases, and high-value purchases are more valuable to the
business. [1] used the RFM method to segment customers of an online retailer and found that
it was effective in identifying high-value customers and improving the business's overall
profitability.
In addition to the RFM method, clustering algorithms have also been widely used for
customer segmentation. [2] compared the performance of different clustering algorithms,
including K-means, SOM, and DBSCAN, on an e-commerce dataset. They found that the K-
means algorithm outperformed the others in terms of precision, recall, and F1-score.
[3] used K-means, GMM, and DBSCAN clustering algorithms to segment customers based
on their online shopping behavior. They found that the K-means algorithm was the most
accurate in identifying customer groups with distinct behavior patterns.
[5] used to cluster techniques, including K-means, DBSCAN, and Spectral clustering, to
detect credit card fraud. They found that the Davies-Bouldin index was a reliable metric for
evaluating the performance of the clustering algorithms.
DBSCAN is another clustering algorithm that is often used for customer segmentation. This
algorithm groups together points that are close to each other and separates points that are
farther away. DBSCAN can handle datasets with irregular shapes and is not sensitive to
outliers. It has been shown to be effective in identifying dense regions in customer data,
which can be useful in identifying customer preferences and behaviors.
GMM is a probabilistic clustering algorithm that assumes that the data points are generated
from a mixture of Gaussian distributions. GMM can identify complex patterns in customer
data and can handle datasets with overlapping clusters. It has been shown to be effective in
identifying hidden customer segments, which may not be apparent from traditional
demographic data [1].
RFM clustering is a customer segmentation technique that is based on three key metrics: how
recently a customer has made a purchase, how frequently a customer makes purchases, and
Overall, these clustering techniques have been widely used in customer segmentation and
have been shown to be effective in identifying distinct customer segments. Each technique
has its own strengths and weaknesses, and the choice of technique will depend on the specific
characteristics of the dataset and the research questions at hand.
For a marketer, the first and most fundamental constraint of cluster analysis is that you must
have access to relevant customer data. For instance, if we work for a service company, we
may have a reasonable client database that we can use to perform cluster analysis and
discover market categories. Larger companies are more likely to have access to relevant
marketing research survey data. However, most businesses, particularly smaller and newer
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
14
ones, will lack access to relevant data and hence will be unable to use cluster analysis.
Cluster analysis is also limited by the fact that it is merely a statistical technique that
presupposes no prior knowledge of the market or how consumers might react. In other words,
it's simply clustering data around a set of core points, which may or may not make sense after
the analysis is completed. The technique's talent is not in doing the analysis, but in
understanding and applying the results to identify appropriate market segments and then
developing a successful marketing strategy that targets one or more of these segments.
Besides, some cluster analysis methods produce somewhat different results each time the
statistical analysis is done. This can arise because there is no one-size-fits-all method to data
analysis. Some type of random or arbitrary approach to "guessing" the possible locations
(means) of the various data centers (that is, market segments) is chosen at the start, especially
when there are many variables to consider. As a result, the results can vary based on the
original starting point (seed) of the data for each of the segments. Even if you use the same
statistical program to run the same data set, this can happen since the underlying statistical
technique can be to utilize a random starting point. [7]
Despite the promising results of the previous studies, there are several limitations that need to
be addressed. First, most of the studies used a limited number of clustering algorithms, which
might not be suitable for all types of datasets. Second, some studies did not include domain
knowledge in the clustering process, which might result in less effective segmentation. Third,
some studies used a limited number of performance metrics, which might not fully capture
the effectiveness of the segmentation. Finally, some studies focused on a specific industry or
dataset, which might limit the generalizability of the findings to other contexts.
These limitations highlight the need for a more comprehensive and systematic approach to
customer segmentation using clustering algorithms. In this study, we aim to address these
limitations by comparing multiple clustering algorithms, incorporating domain knowledge,
using a variety of performance metrics [8].
The following code was used to import and load the dataset into a pandas dataframe:
#Imports
import numpy as np
import pandas as pd
The dataset has 29 columns and 2,240 rows. The columns and their respective data types are
listed in figure 3.1:
In our study, data cleaning performed on the 'marketing_campaign.csv' dataset that was
imported using pandas. First, missing data was checked using the isnull() function in pandas.
There were total of 24 missing value in feature ‘Income’.
To handle the missing data, we decided to impute the missing values using the MICE
imputation method. This method replaces missing data with predicted values based on other
variables in the dataset. mice() function from the impyute library used to impute the missing
values. The IterativeImputer (MICE) performns multiple regressions over random sample
ofthe data, then takes the average ofthe multiple regression values and uses that value to
impute the missing value. MICE selected to use due to it was one type of univariate
imputation algorithm, which imputes values in the i-th feature dimension using only non-
missing values in that feature dimension (e.g. [Link]). By contrast,
multivariate imputation algorithms use the entire set of available feature dimensions to
estimate the missing values (e.g. [Link]). The MICE code was shown in
Appendix.
After handling missing value and duplicated features, the next step is performing the
following cleaning steps:
a) Restructuring Datatypes: the date format converted of the 'Dt_Customer' column to a
datetime format using the pd.to_datetime() function. the age of each customer
calculated by subtracting their birth year from the current year.
b) Combining features: Created a new feature called 'Total_AcceptedCampaign' which is
the sum of all accepted campaigns and responses.
c) Calculating important metrics: Calculate the days enrolled by subtracting the last
enrollment date from the 'Dt_Customer' column. Calculate the total spending of each
customer by adding up the spending on different products. Additionally, combined the
'Marital_Status', 'Kidhome', and 'Teenhome' columns to create a new 'Familysize'
feature.
d) Removing features not used: Remove columns that were not relevant to our analysis,
such as 'Year_Birth', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
'MntSweetProducts', 'MntGoldProds','Kidhome', and 'Teenhome'. Columns related to
the accepted campaigns, complaints, and purchase types were dropped.
e) These cleaning steps were performed using pandas functions such as drop(), sum(),
replace(), astype(), and fillna(). The cleaned dataset was saved as a new CSV file for
further analysis.
Finally, The data was ensured in the correct format for analysis. The 'date' column was
conveted to a datetime format using the to_datetime() function in pandas, and the categorical
columns was encoded using label encoding from the [Link] library.
Overall, the data cleaning process helped to ensure that the dataset was ready for analysis and
that any errors or inconsistencies in the data were handled appropriately.
After the feature extraction was handled above, outliers in the numerical columns were
checked using boxplots and histograms. Outliers were found in the 'Age' and 'Income'
columns, which were removed using the IQR (Interquartile Range) method as shown in
Figure 3.3. The removal of outliers was justified as extreme values outside the range of
typical values for these features were observed, which can occur due to measurement errors,
data entry errors, or other anomalies in the data. Since outliers can have a significant impact
on the model's performance, it was decided to remove them using the commonly used IQR
method. By removing these outliers, the model's accuracy and reliability were improved.
Figure 3.4 shows that the total number of data after removing the outliers are 2212 and thus
the final number of data to segment after cleaning and engineering was 2212 rows.
The data cleaning and feature engineering steps resulted in a cleaned and optimized dataset
that can be used for customer segmentation analysis. By removing outliers and dropping low
variance features, the dataset became more accurate and reliable for analysis. The new
features created were also useful for uncovering insights and trends in the data.
Overall, the data cleaning and feature engineering steps are critical in preparing the data for
analysis and ensuring that the results obtained are trustworthy and valuable.
By scaling and encoding the data, it was the last step before performing any machine learning
model and it was ready to implement any clustering algorithms.
It is important to note that in this project, PCA was only used for visualization purposes and
not for feature selection. By reducing the dimensionality of the feature space, the data could
be more easily visualized and explored. However, the original features were retained for
modelling and analysis purposes. By utilizing PCA for visualization, the data was able to be
presented in a more clear and concise manner.
In this analysis, PCA was performed on the preprocessed dataset to reduce the feature space
while retaining as much information as possible. The number of principal components was
chosen based on the amount of variance explained by each component and the cumulative
variance explained. In Figure 3.5, the dimensionality of the data was reduced from the
original 29 features to just 3 principal components using PCA. The resulting scatter plot
shows that the data points are now more tightly clustered and appear to form two distinct
groups. This suggests that the two clusters identified in the k-means clustering algorithm may
indeed be representative of underlying subgroups in the data. Overall, this analysis using
PCA provides a useful visualization of the data and supports the findings of the k-means
clustering analysis.
After running PCA, the dataset was reduced to n components, and these components were
used as input features for the subsequent modeling steps. The resulting PCA components
were then examined to identify the most important variables in each component and their
contributions to the overall variance in the dataset.
Overall, the use of PCA for dimensionality reduction resulted in a more manageable feature
space, while still retaining much of the original information present in the dataset.
The internal validation metrics for clustering, such as the ones mentioned in the previous
question (AIC, BIC, CH Index, DB Index, and Silhouette Score), are used to evaluate the
quality of the clustering structure without reference to external information or a ground truth.
These metrics assess how well the data points within each cluster are like each other and how
dissimilar they are to the points in other clusters.
While external validation metrics provide a more objective evaluation of clustering quality,
they are often not feasible in practice because the true class labels or external information
may not be available. Additionally, even when external information is available, it may not
necessarily be the best representation of the true structure of the data, which makes internal
validation metrics more generalizable.
Therefore, internal validation metrics are often preferred in practice for evaluating clustering
algorithms since they can provide insight into the clustering structure and its quality, even
when external information is not available. Hence, this study only used for the internal
validation metrics due to the dataset was no ground truth.
These indices are commonly used for evaluating the quality of clustering results. The AIC
and BIC indices are used for model selection, where a lower value indicates a better model
fit. The Calinski-Harabasz and Davies-Bouldin indices are used for evaluating the cluster
quality, where a higher value indicates better clustering performance.
The Silhouette Score is another commonly used metric for evaluating the quality of clustering
results. The score measures how similar an object is to its own cluster compared to other
clusters. The score ranges from -1 to 1, where a score closer to 1 indicates better clustering
performance.
The optimal number of clusters was determined by evaluating several performance metrics
including the Davies-Bouldin index, silhouette score, AIC, BIC, and Calinski-Harabaz score
for a range of values of k from 2 to 10. The results for each metric were plotted and analyzed.
Based on the analysis, it was found that the optimal number of clusters, k was 2. This was
determined by the metric with the highest score for k=2. Other metrics were also considered,
and they were consistent with this result. The selection of the optimal number of clusters was
based on a careful evaluation of multiple performance metrics to arrive at a robust and
reliable solution.
GMM is a probabilistic model that assumes that the data is generated from a mixture of
Gaussian distributions. It is a soft clustering algorithm, which means that instead of assigning
each data point to a single cluster, it assigns probabilities to each data point belonging to each
of the clusters. The algorithm uses an EM algorithm to iteratively estimate the parameters of
the Gaussian distributions and the probabilities of the data points belonging to each cluster.
DBSCAN is a density-based clustering algorithm that groups together data points that are
closely packed together (high density) and separates out data points that are far apart (low
density). It works by defining a radius around each data point and grouping together data
points that fall within that radius. The algorithm can handle noise and outliers by classifying
them as not belonging to any cluster.
To implement the algorithm, the data was first standardized using z-score normalization to
ensure that all variables had the same scale. The k-means clustering was performed using the
scikit-learn library with different values of k and random initializations. To determine the
optimal number of clusters, several metrics such as the Davies-Bouldin index, silhouette
score, AIC, BIC, and Calinski-Harabaz score were evaluated. These metrics allowed us to
assess the quality of the resulting clusters and determine the optimal number of clusters.
Based on the analysis, k was determined to be 2 as it had the lowest BIC and AIC scores. The
quality of the clusters was also assessed using the silhouette score and the Davies-Bouldin
index. The results indicated that the identified clusters were well-separated and distinct.
To mitigate the sensitivity of the k-means algorithm to initial centroid placement, the
algorithm was run multiple times with different initializations, and the solution with the
lowest sum of squared distances was chosen. In addition, other clustering algorithms such as
GMM clustering and DBSCAN were explored to compare their performance with k-means
clustering.
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
27
K-means clustering is a useful tool for exploratory data analysis and pattern recognition, and
it can be applied to a wide range of applications such as market segmentation, image
processing, and anomaly detection. However, it is important to carefully choose the number
of clusters and evaluate the quality of the resulting clusters to avoid suboptimal results. The
pipeline was as below:
1. Standardize the data using z-score normalization
2. Choose the number of clusters
3. Initialize centroids randomly
4. Assign data points to the nearest centroid
5. Recalculate centroids as the mean of all data points assigned to each cluster
6. Repeat until convergence
7. Evaluate the quality of the resulting clusters using metrics such as Davies-Bouldin
index, silhouette score, AIC, BIC, and Calinski-Harabaz score
8. Choose the solution with the lowest sum of squared distances
9. Compare the performance of k-means clustering with other clustering algorithms such
as GMM and DBSCAN
GMM clustering involves finding the optimal parameters for the Gaussian distributions that
best fit the data. This is typically done using the Expectation-Maximization (EM) algorithm,
which is an iterative algorithm that alternates between computing the expected probability of
each data point belonging to each cluster (the "E" step) and updating the parameters of the
Gaussian distributions based on these probabilities (the "M" step). The EM algorithm
continues until convergence, which is usually defined as a small change in the likelihood
function or the parameters.
To perform GMM clustering, the number of clusters (k) and the initialization method for the
parameters was first to find and determine. Then, fit the GMM model to the data using the
EM algorithm, and obtain the cluster assignments and the parameters for each Gaussian
distribution. Then use these cluster assignments to label new data points and analyze the
characteristics of each cluster.
Like k-means clustering, various performance metrics can be used to evaluate the quality of
the GMM clustering results. These include the Davies-Bouldin index, silhouette score, AIC,
BIC, and Calinski-Harabaz score. Compare the results of GMM clustering to those of k-
means clustering and other clustering algorithms to determine the most appropriate method
for the dataset.
Overall, GMM clustering is a powerful tool for identifying clusters in data with complex
distributions and can provide additional insights beyond what is possible with simpler
clustering algorithms such as k-means.
In this project, the data was first standardized using z-score normalization to ensure that all
variables had the same scale. Then, the GMM model was fitted to the normalized data using
the expectation-maximization algorithm with a full covariance matrix and random state of 42.
To determine the optimal number of clusters, the Bayesian Information Criterion (BIC) and
the Akaike Information Criterion (AIC) were evaluated for a range of values of k, the number
of clusters. The BIC and AIC measures indicate the trade-off between the goodness of fit and
the complexity of the model, with lower values indicating a better fit with fewer parameters.
Based on the analysis, k was determined to be 2 as it had the lowest BIC and AIC scores. The
quality of the clusters was also assessed using the silhouette score and the Davies-Bouldin
index. The results indicated that the identified clusters were well-separated and distinct.
The following figure are the output for GMM in this study:
In this study, DBSCAN algorithm modified by using the value of the min_samples feature
column multiplied by 2 as the value for the min_samples hyperparameter. This was done to
increase the minimum number of data points required to form a cluster and ensure that the
clusters were well-defined.
To determine the optimal value for the eps hyperparameter, GridSearchCV, a cross-validation
method was utilized that systematically searches for the best combination of
hyperparameters. A range of values for eps was tested and evaluated the results using the
silhouette score and the Davies-Bouldin index. The silhouette score measures the similarity
of a data point to its own cluster compared to other clusters, while the Davies-Bouldin index
measures the average similarity between each cluster and its most similar cluster. Lower
values of the Davies-Bouldin index indicate better clustering performance.
After hyperparameter tuning, the optimal value for eps found was 0.1. The value for the
modified min_samples hyperparameter also optimized to ensure that each cluster had a
minimum number of data points. The optimal value for the modified min_samples
hyperparameter was 18.
Finally, the clusters visualized using a scatter plot of the data points colored by their assigned
cluster label. These outlines gain insights into the characteristics of each cluster and their
relationship to each other.
The following figure was the output for DBSCAN in this study:
The k-means algorithm, a popular unsupervised learning technique, was used to cluster the
customers based on their RFM scores. The RFM values were first standardized using z-score
normalization to ensure that all variables had the same scale. The optimal number of clusters
to use for k-means clustering was determined using the elbow method. Different values of k
were tested, and the resulting within-cluster sum of squares (WCSS) was evaluated to
identify the k value that minimized WCSS while still preserving meaningful clusters.
Other clustering algorithms such as GMM and DBSCAN were explored to compare their
performance with k-means clustering. However, k-means clustering was found to provide the
best balance of performance and interpretability for the RFM analysis.
In this study, the recency feature already consists of and the Frequency and Monetary that
suitable in the dataset was the total purchases and total spending. Thus, this RFM algorithms
was implemented separately from other algorithms above as it needed to use different data
pre-processing like RFM mapping and distribution.
The following figure was the output for RFM in this study:
4.2 Software
Jupyter Notebook was chosen to used in this project since it was a popular integrated
development environment (IDE) for Python that offers several benefits, some of which are:
Interactive Computing: Jupyter Notebook allows you to write and execute code
snippets in an interactive way. This means that you can run code cell by cell, making
it easy to test and debug code.
Data Visualization: Jupyter Notebook comes with built-in visualization tools that
enable you to create and display charts, graphs, and other visualizations of your data
directly in the notebook.
Markdown Support: Jupyter Notebook supports Markdown, which allows you to add
formatted text, links, and images to your notebook. This makes it easy to add
documentation, comments, and explanations to your code.
The Anaconda Prompt provides an easy way to launch the graphical user interface (GUI) of
Python applications. To run a Python GUI application using the Anaconda Prompt, it can
simply navigate to the directory containing the application's code and run the command to
launch the GUI. For example, the python script named “[Link]” that has a
GUI can be navigated via this prompt with the command “python [Link]”. to
run the GUI
Overall, the Anaconda Prompt is a convenient tool for running Python code and managing
Anaconda environments, and it can help streamline the process of launching Python GUI
applications.
The traditional k-means algorithm was implemented using the KMeans class from scikit-
learn, with 4 clusters and the default initialization parameters. The incremental k-means
algorithm was implemented using the IncrementalKMeans class, with a batch size of 100 and
a maximum of 10 iterations. The mini-batch k-means algorithm was implemented using the
MiniBatchKMeans class, with a batch size of 100 and 1000 maximum iterations.
Incremental k-means updates the clustering model with new data points one at a time, and
adapts the centroids of the existing clusters accordingly. This allows the model to learn from
new data without having to reprocess the entire dataset, which can be time-consuming for
large datasets. However, incremental k-means may be less accurate than traditional k-means
due to the smaller batch size.
Mini-batch k-means, on the other hand, uses random subsets (or "mini-batches") of the data
to update the clustering model, instead of processing the entire dataset at once. This can be
much faster than incremental k-means, as well as traditional k-means, because the algorithm
only needs to process a small batch of data at a time. Mini-batch k-means can also be more
accurate than incremental k-means, because it uses a larger batch size to update the centroids.
The GUI consists of three input fields for the recency, frequency, and monetary values,
respectively, as well as a "Predict" button and a "Plot" button. When the user inputs the
values and clicks the "Predict" button, the GUI uses the MiniBatchKMeans class from the
scikit-learn library to update the clustering model with the new data point and predict its
cluster label. The resulting label is displayed to the user in a text label on result section.
Furthermore, the function of CRUD also be implemented like user can add new customers
data, delete new/old customers data and predict the cluster results. The read function was
design as a table shown in Figure , users can straight away select the customers in the table to
predicts clusters label or delete the customer data.
Overall, the GUI provides a user-friendly interface for interacting with the RFM clustering
model, allowing users to input new data points and explore the clustering results in an
intuitive way.
In this study, different approaches were separated from determining the number of clusters
and more ease to do the project. Three different files will included in this study which were
“Customer segmentation using k-means, GMM, and DBSCAN clustering”, “Customer
segmentation using RFM clustering”, and “Visualization after data-preprocessing”. There
were various clustering algorithms used in this project as mentioned in chapter 3. The reason
why RFM algorithms used in different files are it was a behavior segmentation based
compare with others algorithms.
Figure 5.1 shows the result is elbow on k=2, the calinski-harabaz was 5076.938 and the
silhouette score was 0.600.
5.1.2 RFM+kmeans
There were two metrics evaluation used which were elbow and silhouette score:
The optimal cluster based on the elbow method is 4, while according to silhouette score is 2.
Since silhouette score method considers intra and inter clusters, it might produce more
separated clusters. However, two clusters might be too general for customer segmentation
which require more specific cluster to build personalized offers. Thus, k=4 will be used to
perform clustering [1].
To identify the optimal number of clusters in our customer segmentation analysis, several
evaluation metrics were used to assess the performance of different clustering algorithms.
Specifically, looked for the number of clusters that best satisfied the following criteria: a
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
45
lower value of AIC or BIC, a higher value of the Calinski-Harabasz score, a Silhouette score
approaching 1, and a lower Davies-Bouldin Index score. A lower value of AIC or BIC
indicates a better fit for the model, while a higher value of the Calinski-Harabasz score
indicates better clustering performance. The Silhouette score, which ranges from -1 to 1, can
be used to evaluate the quality of clustering, with a score of 1 denoting the best clustering and
a score approaching 1 being desirable. The Davies-Bouldin Index, which ranges from 0 to 1,
can be used to evaluate the compactness and separation of clusters, with a lower score
indicating better clustering performance. By using these evaluation metrics, the optimal
number of clusters were able to identify for the customer segmentation analysis, gain insights
into the different customer segments and their characteristics.
5.2.2 DBSCAN
The choice of clustering metrics differs between DBSCAN and k-means/GMM due to the
unique hyperparameters of DBSCAN, namely ε and min_samples, which are distinct from
the traditional number of clusters or components. In Figure 5.6, the clustering results were
analyzed using GridSearchCV() based on the Silhouette Score, resulting in an optimal ε value
of 0.1. Furthermore, Figure 5.7 highlights the best combination of ε and min_samples as 0.3
and 4, respectively. Consequently, these values were selected to achieve more accurate
clustering results with DBSCAN compared to the previous settings.
The result shows that there were total number of 999 customers in cluster 1 and total number
of 1213 customers in cluster 2 and hence the number of clusters 2 is more than cluster 1.
The figure 5.16 concern that number of customers in cluster 1 is the highest while the number
of customers in cluster 2 is the lowest.
Based on figure 5.17, The clusters separate the values of RFM into two groups, low and high,
as observed from the boxplot and centroid analysis.
The Recency values are categorized as follows: Low (22-23 days) and High (73 days). The
Frequency values are divided into Low (7 purchases) and High (19 purchases). Monetary
Cluster 0 represents the group of Low-Spending Active Customers with low recency,
low frequency, and low monetary values. This segment comprises 619 customers,
which accounts for 28% of the total customer base.
Cluster 2 identifies the Best Active Customers with low recency, high frequency, and
high monetary values. These customers are highly valuable to the company, as they
make frequent high-value purchases. However, their number is relatively small,
comprising only 466 customers or 21% of the total customer base.
Cluster 3 represents the Churned Best Customers, characterized by high recency, high
frequency, and high monetary values. Although these customers have contributed
significantly to the company's revenue, their last purchase occurred a long time ago,
potentially indicating churn. This cluster consists of 524 customers, equivalent to 23%
of the total customer base.
Based on Figure 5.18, it is evident that Cluster 2 stands out as the best customer segment
when compared to the other clusters. This is primarily due to their lower recency, higher
frequency, and higher monetary values. Specifically, customers in this cluster tend to visit the
company for shorter periods but exhibit higher visit frequency and spending patterns. These
customers may visit the company multiple times within a week and spend significantly more
on their purchases.
On the other hand, customers in Cluster 3 demonstrate higher values in terms of recency,
frequency, and monetary factors, indicating that they are highly valuable customers. They
make substantial purchases and spend more, but their visit intervals are longer, suggesting a
longer gap between their visits to the company.
In contrast, customers in Cluster 0 exhibit considerably lower values across all three
dimensions—recency, frequency, and monetary. This cluster represents the worst-performing
group within the dataset, as these customers display lower levels of engagement, visit the
company less frequently, and make smaller purchases compared to the other clusters.
Initially, after preprocessing the data, K-means clustering was employed, which is a widely
used method for partitioning data into groups. It determined clusters based on the proximity
of customers in the feature space. However, K-means assumes spherical-shaped clusters with
equal variances, which may not always hold true in real-world scenarios.
Lastly, RFM analysis was employed, a valuable technique for customer segmentation based
on recency, frequency, and monetary value. RFM provided insights into customer
transactional behavior, allowing for the classification of customers into distinct segments
with unique characteristics and marketing implications.
Throughout the analysis, each clustering technique demonstrated strengths and limitations. K-
means offered a straightforward and interpretable approach but struggled with complex and
In addition to the clustering techniques, a user-friendly GUI was developed based on RFM
analysis and the Mini-Batch K-means algorithm. This GUI provides businesses with an
intuitive platform to perform customer segmentation on their own datasets. By incorporating
RFM analysis, the GUI enables users to analyze customer transactional behavior and identify
valuable customer segments. The integration of the Mini-Batch K-means algorithm ensures
efficient and scalable clustering, making it suitable for large datasets. The GUI's
visualizations and interactive features enhance the interpretation and exploration of the
segmentation results, empowering businesses to make data-driven decisions.
[2] Kumar, A., Jain, A., Jain, S., & Jain, S. (2016). Comparative study of clustering
algorithms for customer segmentation in e-commerce. In 2016 International Conference on
Computing, Analytics and Security Trends (CAST) (pp. 300-305). IEEE. DOI:
10.1109/CAST.2016.79
[3] Xiang, Y., & Gong, Y. (2018). Online Shopping Behavior Analysis Based on K-means,
GMM and DBSCAN Clustering Algorithm. In 2018 International Conference on
Computational Science and Computational Intelligence (CSCI) (pp. 1122-1126). IEEE. DOI:
10.1109/CSCI46756.2018.00209
[4] Chen, X., Zuo, X., Wu, Z., & Liu, X. (2020). A Comparative Study of Customer
Segmentation Methods Based on Online Shopping Behavior. In 2020 IEEE International
Conference on Industrial Engineering and Engineering Management (IEEM) (pp. 791-795).
IEEE. DOI: 10.1109/IEEM47687.2020.9378743
[5] Jadhav, M., & Sonawane, K. (2021). Credit card fraud detection using clustering
techniques. In 2021 11th International Conference on Cloud Computing, Data Science &
Engineering (Confluence) (pp. 1-6). IEEE. DOI:
10.1109/CONFLUENCE51715.2021.9461624
[6] Yeh, Y.-L., & Huang, C.-C. (2018). Customer segmentation of bicycle-sharing users
based on their usage behavior. Sustainability, 10(5), 1579. DOI: 10.3390/su10051579
[7] Allenby, G., Fennell, G., Bemmaor, A., Bhargava, V., Christen, F., Dawley, J., Dickson,
P., Edwards, Y., Garratt, M., Ginter, J., Sawyer, A., Staelin R. & Yang, S. (2002) Market
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
64
segmentation research: beyond within and across group differences. Marketing Letters, 13, 3,
pp. 233–243.
[9] Dolničar, S. & Leisch, F. (2004) Segmenting markets by bagged clustering. Australasian
Marketing Journal, 12, 2, pp. 51–65.
[10] Dimitriadou, E., Dolničar, S. & Weingessel, A. (2002) An examination of the number of
indexes for determining the number of clusters in binary data sets. Psychometrika, 67, 2, pp.
137–160.
[11] Decker, R., Wagner, R. & Scholz, S.W. (2005) Growing clustering algorithms in market
segmentation: defining target groups and related marketing communication. In H.-H. Bock,
W. Gaul & M. Vicki (eds) Data Analysis, Classification and the Forward Search. Berlin:
Springer, pp. 23–30.
Online Article:
[12] Qualtrics, "Customer Segmentation: Definition & Methods", Qualtrics AU, 2020.
[Online].
Available:
[Link]
segmentation/?rid=ip&prevsite=en&newsite=au&geo=MY&geomatch=au.
[13] D. Gong, "Clustering Algorithm for Customer Segmentation", Medium, 2021. [Online].
Available:
[Link]
e2d79e28cbc3.
Here is an example of the code used for some of the cleaning steps:
# change date format
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y')
# total spendings
df['Spend'] = df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
'MntSweetProducts', 'MntGoldProds']].sum(axis=1)
df['Familysize'] = df['Marital_Status'].replace({"Married":"Relationship",
"Together":"Relationship",
"Widow":"Single", "Divorced":"Single",
"Single":"Single","Alone":"Single", "Absurd":"Single",
"YOLO":"Single"}).replace({'Single': 1, 'Relationship' : 2}).fillna(0).astype(int) +
df['Kidhome'] + df['Teenhome']
#Total purchases
df['Total_Purchases'] =
df[['NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases']
].sum(axis=1)
autoscaler = StandardScaler()
df[int_list] = autoscaler.fit_transform(df[int_list])
label_encoder = LabelEncoder()
for col in obj_col:
df[col] = label_encoder.fit_transform(df[col])
Performance Metric:
# k-means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_preds = kmeans.fit_predict(df_norm)
plot_km = df
plot_km['KMeans Cluster'] = kmeans_preds
plot_km=plot_km.sort_values(by='Cluster')
plot_km['KMeans Cluster'] = plot_km['Cluster'].astype(str).apply(lambda x: 'Outliers' if x ==
'-1' else x)
# # Reverse normalization
# plot_km['Income'] = plot_km['Income'] * 100
# plot_km['Spend'] = plot_km['Spend'] * 1000
# Plot of clusters
temp = [Link]()
[Link].plot_bgcolor = '#F9F9F9'
[Link].paper_bgcolor = '#F9F9F9'
[Link] = '#4D4D4D'
[Link] = '#4D4D4D'
fig = [Link](plot_km, x="Income", y="Spend", color="KMeans Cluster",
color_discrete_sequence=[Link].T10[2:])
fig.update_traces(marker=dict(size=11, opacity=0.85, line=dict(width=1, color='#F7F7F7')))
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
69
fig.update_layout(template=temp, title="KMeans Cluster Profiles,<br>Customer Spending
vs. Income",
width=700, legend_title='Cluster',
paper_bgcolor='rgb(229, 236, 246)',
title_font_size=22,
xaxis=dict(title='Spending', showline=True, zeroline=False, range=[0, None]),
yaxis=dict(title='Income, $', ticksuffix='k', showline=True, range=[0, None]))
[Link]()
K-means Imputation:
# k-means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_preds = kmeans.fit_predict(df_norm)
plot_km = df
plot_km['KMeans Cluster'] = kmeans_preds
plot_km=plot_km.sort_values(by='Cluster')
plot_km['KMeans Cluster'] = plot_km['Cluster'].astype(str).apply(lambda x: 'Outliers' if x ==
'-1' else x)
# # Reverse normalization
# plot_km['Income'] = plot_km['Income'] * 100
# plot_km['Spend'] = plot_km['Spend'] * 1000
# Plot of clusters
temp = [Link]()
[Link].plot_bgcolor = '#F9F9F9'
[Link].paper_bgcolor = '#F9F9F9'
[Link] = '#4D4D4D'
[Link] = '#4D4D4D'
fig = [Link](plot_km, x="Income", y="Spend", color="KMeans Cluster",
color_discrete_sequence=[Link].T10[2:])
fig.update_traces(marker=dict(size=11, opacity=0.85, line=dict(width=1, color='#F7F7F7')))
fig.update_layout(template=temp, title="KMeans Cluster Profiles,<br>Customer Spending
vs. Income",
[Link]()
GMM Imputation:
# Gaussian Mixture Model clustering
gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=42)
[Link](df_norm)
gmm_preds = [Link](df_norm)
plot_gmm = df
plot_gmm['GMM Cluster'] = gmm_preds
plot_gmm = plot_gmm.sort_values(by='Cluster')
plot_gmm['GMM Cluster'] = plot_gmm['GMM Cluster'].astype(str).apply(lambda x:
'Outliers' if x == '-1' else x)
# Plot of clusters
temp = [Link]()
[Link].plot_bgcolor = '#F9F9F9'
[Link].paper_bgcolor = '#F9F9F9'
[Link] = '#4D4D4D'
[Link] = '#4D4D4D'
fig = [Link](plot_gmm, x="Income", y="Spend", color="Cluster",
color_discrete_sequence=[Link].T10[2:])
fig.update_traces(marker=dict(size=11, opacity=0.85, line=dict(width=1, color='#F7F7F7')))
fig.update_layout(template=temp, title="Gaussian Mixture Model Cluster
Profiles,<br>Customer Spending vs. Income",
width=700, legend_title='Cluster',
paper_bgcolor='rgb(229, 236, 246)',
title_font_size=22,
xaxis=dict(title='Spending', showline=True, zeroline=False),
yaxis=dict(title='Income, $', ticksuffix='k', showline=True))
[Link]()
DBSCAN Imputation:
# DB Scan clustering
db=DBSCAN(eps=0.3, min_samples=4, metric='euclidean')
db_preds=db.fit_predict(df_norm)
plot_db=df
plot_db['DB Cluster'] = db_preds
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
71
plot_db=plot_db.sort_values(by='Cluster')
plot_db['DB Cluster'] = plot_db['Cluster'].astype(str).apply(lambda x: 'Outliers' if x == '-1'
else x)
# Plot of clusters
temp = [Link]()
[Link].plot_bgcolor = '#F9F9F9'
[Link].paper_bgcolor = '#F9F9F9'
[Link] = '#4D4D4D'
[Link] = '#4D4D4D'
fig = [Link](plot_db, x="Income", y="Spend", color="Cluster",
color_discrete_sequence=[Link].T10[2:])
fig.update_traces(marker=dict(size=11, opacity=0.85, line=dict(width=1, color='#F7F7F7')))
fig.update_layout(template=temp, title="DBSCAN Cluster Profiles,<br>Customer Spending
vs. Income",
width=700, legend_title = 'Cluster',
paper_bgcolor='rgb(229, 236, 246)',
title_font_size=22,
xaxis=dict(title='Spending',showline=True, zeroline=False),
yaxis=dict(title='Income, $',ticksuffix='k',showline=True))
[Link]()
RFM Imputation:
df['Frequency'] =
df['NumWebPurchases']+df['NumCatalogPurchases']+df['NumStorePurchases']
df['Monetary'] =
df['MntWines']+df['MntFruits']+df['MntMeatProducts']+df['MntFishProducts']+df['MntSweet
Products']+df['MntGoldProds']
#Data distribution
from [Link] import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
[Link]()
# RFM Mapping
# plotting rfm
fig = px.scatter_3d(df_rfm_scaled, x='Recency', y='Frequency', z='Monetary',
title='<b>RFM Mapping</b>',
opacity=0.5, color='Monetary', color_continuous_scale='electric')
fig.update_traces(marker=dict(size=5))
[Link]()
fig.update_traces(line_width=4)
[Link]()
fig.update_traces(line_width=4)
[Link]()
fig.update_traces(marker=dict(size=6, opacity=0.6))
[Link]()
[Link]()
for i in Personal:
fig = [Link](data_frame=df, x=i, y="Spend", color="Cluster", marginal_y="violin",
marginal_x="box",
color_discrete_sequence=[Link])
fig.update_layout(title=[Link]() + " vs Spending",
xaxis_title=[Link](),
yaxis_title="Total Expense",
paper_bgcolor='rgb(229, 236, 246)',
plot_bgcolor='rgb(229, 236, 246)',
Bachelor of Computer Science (Honours)
Faculty of Information and Communication Technology (Kampar Campus), UTAR
77
width=800,
height=600,
template="simple_white")
[Link]()
for i in Personal:
fig = [Link](data_frame=df, x=i, y="Spend", color="Clusters", marginal_y="violin",
marginal_x="box",
color_discrete_sequence=colors_cluster,
category_orders=dict(Clusters=[0,1,2,3]))
fig.update_layout(title=[Link]() + " vs Spending",
xaxis_title=[Link](),
paper_bgcolor='rgb(229, 236, 246)',
plot_bgcolor='rgb(229, 236, 246)',
template="simple_white")
[Link]()
data = pd.read_csv('marketing_campaign.csv',delimiter='\t')
numeric_cols = ['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines',
'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
'NumWebVisitsMonth']
data_numeric = data[numeric_cols]
X = StandardScaler().fit_transform(data_numeric)
class CustomerSegmentation(QWidget):
def __init__(self):
super().__init__()
[Link] = pd.read_csv('marketing_campaign.csv',delimiter='\t')
[Link]('Customer Segmentation')
[Link](100, 100, 800, 600)
header = QLabel('RFM')
[Link]('font-size: 36px; font-weight: bold; color: #fff;')
subheader = QLabel('Customer Segmentation')
[Link]('font-size: 18px; color: #666; margin-bottom: 20px;')
layout = QGridLayout()
[Link](20, 20, 20, 20)
[Link](10)
[Link](header, 0, 0, 1, 2, [Link])
[Link](subheader, 1, 0, 1, 2, [Link])
recency_label = QLabel("Recency:")
recency_icon = QIcon('[Link]')
self.recency_input = QLineEdit()
self.recency_input.setPlaceholderText("Days Since Last Purchase")
self.recency_input.setStyleSheet('font-size: 14px; color: #666;')
recency_label.setPixmap(recency_icon.pixmap(24, 24))
[Link](recency_label, 2, 0)
[Link](self.recency_input, 2, 1)
frequency_label = QLabel("Frequency:")
frequency_icon = QIcon('[Link]')
self.frequency_input = QLineEdit()
self.frequency_input.setPlaceholderText("Number of Purchases")
self.frequency_input.setStyleSheet('font-size: 14px; color: #666;')
frequency_label.setPixmap(frequency_icon.pixmap(24, 24))
[Link](frequency_label, 3, 0)
[Link](self.frequency_input, 3, 1)
button_layout = QHBoxLayout()
self.add_button = QPushButton('Add')
self.add_button.setIcon(QIcon('[Link]'))
self.add_button.setToolTip('Click to add a new customer')
self.add_button.setStyleSheet('font-size: 14px; color: #fff; background-color: #3b9cff;
border: none; padding: 10px 20px;')
self.add_button.setFixedSize(120, 50)
self.add_button.[Link](self.add_customer)
button_layout.addWidget(self.add_button)
self.del_button = QPushButton('Del')
self.del_button.setIcon(QIcon('[Link]'))
self.del_button.setToolTip('Click to delete the selected customer')
self.del_button.setStyleSheet('font-size: 14px; color: #fff; background-color: #3b9cff;
border: none; padding: 10px 20px;')
self.del_button.setFixedSize(120, 50)
self.del_button.[Link](self.del_customer)
button_layout.addWidget(self.del_button)
self.clear_button = QPushButton('Clear')
self.clear_button.setIcon(QIcon('[Link]'))
self.clear_button.setToolTip('Click to clear the input fields')
self.clear_button.setStyleSheet('font-size: 14px; color: #fff; background-color: #3b9cff;
border: none; padding: 10px 20px;')
self.clear_button.setFixedSize(120, 50)
self.clear_button.[Link](self.clear_inputs)
button_layout.addWidget(self.clear_button)
self.predict_button = QPushButton('Predict')
self.predict_button.setIcon(QIcon('[Link]'))
self.predict_button.setToolTip('Click to predict the customer segment')
self.predict_button.setStyleSheet('font-size: 14px; color: #fff; background-color:
#3b9cff; border: none; padding: 10px 20px;')
self.predict_button.setFixedSize(120, 50)
self.predict_button.[Link]([Link])
button_layout.addWidget(self.predict_button)
[Link](button_layout, 5, 0, 1, 2, [Link])
table_button_layout = QHBoxLayout()
table_button_layout.addLayout(button_layout)
table_button_layout = QHBoxLayout()
[Link] = QTableWidget()
columns_to_drop = ['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'MntWines',
'MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds',
'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases',
'NumWebVisitsMonth', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
'AcceptedCmp4', 'AcceptedCmp5', 'Complain', 'Response', 'Z_CostContact', 'Z_Revenue']
[Link](columns=columns_to_drop, inplace=True)
[Link](len([Link]))
[Link](len([Link]))
# Set the header labels of the table to match the column names of your data
[Link](list([Link]))
[Link]([Link])
[Link]([Link])
[Link]([Link])
[Link](self.select_customer)
[Link]('font-size: 14px;')
[Link](600, 400)
table_button_layout.addWidget([Link])
for i in range(len([Link])):
row_items = []
for col_name in ['ID', 'Recency', 'Frequency', 'Monetary']:
col_index = [Link].get_loc(col_name)
row_items.append(int(float([Link][i, col_index])))
[Link](i)
for j, item in enumerate(row_items):
[Link](i, j, QTableWidgetItem(str(item)))
result_layout = QHBoxLayout()
self.result_label = QLabel("Result:")
self.result_label.setStyleSheet('font-size: 14px; color: #fff;')
result_layout.addWidget(self.result_label)
[Link](result_layout, 8, 0, 1, 2)
[Link] = [Link]([[0, 0, 0]])
[Link](layout)
recency_label.setBuddy(self.recency_input)
frequency_label.setBuddy(self.frequency_input)
monetary_label.setBuddy(self.monetary_input)
[Link]('Fusion')
palette = QPalette()
[Link]([Link], QColor(53, 53, 53))
[Link]([Link], [Link])
[Link]([Link], QColor(25, 25, 25))
[Link]([Link], QColor(53, 53, 53))
[Link]([Link], [Link])
[Link]([Link], [Link])
[Link]([Link], [Link])
[Link]([Link], QColor(53, 53, 53))
[Link]([Link], [Link])
[Link]([Link], [Link])
[Link]([Link], QColor(42, 130, 218))
[Link]([Link], QColor(42, 130, 218))
[Link]([Link], [Link])
[Link](palette)
def add_customer(self):
try:
recency = int(self.recency_input.text())
frequency = int(self.frequency_input.text())
monetary = int(self.monetary_input.text())
except ValueError:
[Link](self, 'Invalid Input', 'Please enter valid numeric values for
recency, frequency, and monetary.')
return
id = int([Link]['ID'].max()) + 1
new_customer = [Link]({
'ID': [id],
'Recency': [recency],
'Frequency': [frequency],
'Monetary': [monetary]
})
[Link](row_num)
for i, value in enumerate([id, recency, frequency, monetary]):
[Link](row_num, i, QTableWidgetItem(str(value)))
[Link] = [Link]([[Link][:row_num], new_customer,
[Link][row_num:]], ignore_index=True)
def del_customer(self):
# Get the selected row index
row_index = [Link]()
if row_index >= 0:
[Link](row_index)
else:
[Link](self, 'Error', 'No customer selected.')
def predict(self):
data['Frequency'] = data['NumWebPurchases'] + data['NumCatalogPurchases'] +
data['NumStorePurchases']
data['Monetary'] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] +
data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']
recency_str = self.recency_input.text().strip()
frequency_str = self.frequency_input.text().strip()
monetary_str = self.monetary_input.text().strip()
try:
customer_segments = {
0: "Low-Spending Active Customers",
1: "Mid-Spending Active Customers",
2: "High-Spending Active Customers",
3: "Churned or Inactive Customers"
}
segment = customer_segments.get(cluster_label)
self.result_label.setText("Cluster {}: {}".format(cluster_label, segment))
def plot(self):
X = data[['Recency', 'Frequency', 'Monetary']]
kmeans = MiniBatchKMeans(n_clusters=4, batch_size=100, random_state=42)
kmeans.partial_fit(X)
fig = Figure()
ax = fig.add_subplot(111, projection='3d')
colors = ['#3C1642', '#086375', '#1DD3B0', '#AFFC41']
legend_elements = [
[Link].Line2D([0], [0], marker='o', color='w', label='Low-Spending Active',
markerfacecolor=colors[0], markersize=10),
[Link].Line2D([0], [0], marker='o', color='w', label='Mid-Spending Active',
markerfacecolor=colors[1], markersize=10),
[Link].Line2D([0], [0], marker='o', color='w', label='High-Spending Active',
markerfacecolor=colors[2], markersize=10),
[Link].Line2D([0], [0], marker='o', color='w', label='Churned or Inactive',
markerfacecolor=colors[3], markersize=10),
]
[Link](handles=legend_elements, loc='upper right')
for i in range(4):
cluster_points = X[kmeans.labels_ == i]
[Link](cluster_points['Recency'], cluster_points['Frequency'],
cluster_points['Monetary'],
c=colors[i], label=f'Cluster {i}', s=50, alpha=0.5)
def show_plot(self):
[Link]()
self.plot_window.show()
def clear_inputs(self):
self.recency_input.clear()
self.frequency_input.clear()
self.monetary_input.clear()
if __name__ == '__main__':
app = QApplication([Link])
segmentation = CustomerSegmentation()
[Link]()
[Link](app.exec_())
1. WORK DONE
Check for the previous work.
2. WORK TO BE DONE
Copy and paste chapters 1 and 2 to new fyp2 template.
3. PROBLEMS ENCOUNTERED
Previous source code got some bugs and error
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
Learnt how to Determine the k value.
2. WORK TO BE DONE
Encounter all the clustering algorithms.
3. PROBLEMS ENCOUNTERED
Various methods to find the optimal k.
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
Optimal k = 4
Implement clustering algorithms k-means.
2. WORK TO BE DONE
Implement GMM and DBSCAN
3. PROBLEMS ENCOUNTERED
Different k value encountered 2 and 4 when used different metrices.
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
Implemented k-means, GMM and DBSCAN
2. WORK TO BE DONE
Performs summary comparison of performance metrices to determine k
3. PROBLEMS ENCOUNTERED
Epsilon of DBSCAN is difficult to determine.
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
Pivot table of summary metrices was implemented, optimal k was 2 and perform the
three clustering algorithms except DBSCAN.
2. WORK TO BE DONE
Use GridSearchCV() to determine the epsilon ok DBSCAN
3. PROBLEMS ENCOUNTERED
Hyperparameter of DBSCAN difficult to determine.
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
Three algorithms, k-means, GMM and DBSCAN successfully implemented and plot the
result in 2D and 3D form. The comparison between the clustering algorithms.
2. WORK TO BE DONE
learn new approach/ algorithm, RFM.
3. PROBLEMS ENCOUNTERED
Difficult to categorize the group of customers if two cluster only
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
Distribution of RFM, RFM modelling and plotting
2. WORK TO BE DONE
Cluster Analysis to categorize the clusters groups.
3. PROBLEMS ENCOUNTERED
Got two k values, 2 and 4 and confirm to use k=4 due to its was behavioral based
clustering and easier to identify.
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
Cluster Analysis and RFM modelling and categorizing.
2. WORK TO BE DONE
GUI development using Incremental k-means.
3. PROBLEMS ENCOUNTERED
Naming of the categorized customer groups.
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
GUI using MiniBatch K-means was developed and got the function of input integer value
recency, frequency and monetary.
2. WORK TO BE DONE
Add CRUD function on the GUI.
3. PROBLEMS ENCOUNTERED
Incremental k-means module cannot implement due to the scikit learn version problem of
my pc and I replaced it to use MiniBatch k-means.
_________________________ _________________________
Supervisor’s signature Student’s signature
1. WORK DONE
User friendly GUI developed.
2. WORK TO BE DONE
Report writing from chapter 3.
3. PROBLEMS ENCOUNTERED
Report organization not sure
_________________________ _________________________
Supervisor’s signature Student’s signature
by source
Internet Sources: 8_%
Publications: 6 %
Student Papers: 12 %
Note Supervisor/Candidate(s) is/are required to provide softcopy of full set of the originality report
to Faculty/Institute
Based on the above results, I hereby declare that I am satisfied with the originality of the Final
Year Project Report submitted by my student(s) as named above.
______________________________ ______________________________
Signature of Supervisor Signature of Co-Supervisor
*Include this form (checklist) in the thesis (Bind together as the last page)
______________________
(TOH WEI XUAN)
Date: 25/4/2023